

# Synthetic dataset
<a name="clarify-online-explainability-create-endpoint-synthetic"></a>

SageMaker Clarify uses the Kernel SHAP algorithm. Given a record (also called a sample or an instance) and the SHAP configuration, the explainer first generates a synthetic dataset. SageMaker Clarify then queries the model container for the predictions of the dataset, and then computes and returns the feature attributions. The size of the synthetic dataset affects the runtime for the Clarify explainer. Larger synthetic datasets take more time to obtain model predictions than smaller ones.

 The synthetic dataset size is determined by the following formula:

```
Synthetic dataset size = SHAP baseline size * n_samples
```

The SHAP baseline size is the number of records in the SHAP baseline data. This information is taken from the `ShapBaselineConfig`.

The size of `n_samples` is set by the parameter `NumberOfSamples` in the explainer configuration and the number of features. If the number of features is `n_features`, then `n_samples` is the following: 

```
n_samples = MIN(NumberOfSamples, 2^n_features - 2)
```

The following shows `n_samples` if `NumberOfSamples` is not provided.

```
n_samples = MIN(2*n_features + 2^11, 2^n_features - 2)
```

For example, a tabular record with 10 features has a SHAP baseline size of 1. If `NumberOfSamples` is not provided, the synthetic dataset contains 1022 records. If the record has 20 features, the synthetic dataset contains 2088 records.

For NLP problems, `n_features` is equal to the number of non-text features plus the number of text units.

**Note**  
The `InvokeEndpoint` API has a request timeout limit. If the synthetic dataset is too large, the explainer may not be able to complete the computation within this limit. If necessary, use the previous information to understand and reduce the SHAP baseline size and `NumberOfSamples`. If your model container is set up to handle batch requests, then you can also adjust the value of `MaxRecordCount`.