

# Fairness, model explainability and bias detection with SageMaker Clarify
<a name="clarify-configure-processing-jobs"></a>

You can use Amazon SageMaker Clarify to understand fairness and model explainability and to explain and detect bias in your models. You can configure an SageMaker Clarify processing job to compute bias metrics and feature attributions and generate reports for model explainability. SageMaker Clarify processing jobs are implemented using a specialized SageMaker Clarify container image. The following page describes how SageMaker Clarify works and how to get started with an analysis.

## What is fairness and model explainability for machine learning predictions?
<a name="clarify-fairness-and-explainability"></a>

Machine learning (ML) models are helping make decisions in domains including financial services, healthcare, education, and human resources. Policymakers, regulators, and advocates have raised awareness about the ethical and policy challenges posed by ML and data-driven systems. Amazon SageMaker Clarify can help you understand why your ML model made a specific prediction and whether this bias impacts this prediction during training or inference. SageMaker Clarify also provides tools that can help you build less biased and more understandable machine learning models. SageMaker Clarify can also generate model governance reports that you can provide to risk and compliance teams and external regulators. With SageMaker Clarify, you can do the following:
+ Detect bias in and help explain your model predictions.
+ Identify types of bias in pre-training data.
+ Identify types of bias in post-training data that can emerge during training or when your model is in production.

SageMaker Clarify helps explain how your models make predictions using feature attributions. It can also monitor inference models that are in production for both bias and feature attribution drift. This information can help you in the following areas:
+ **Regulatory** – Policymakers and other regulators can have concerns about discriminatory impacts of decisions that use output from ML models. For example, an ML model may encode bias and influence an automated decision.
+ **Business** – Regulated domains may need reliable explanations for how ML models make predictions. Model explainability may be particularly important to industries that depend on reliability, safety, and compliance. These can include financial services, human resources, healthcare, and automated transportation. For example, lending applications may need to provide explanations about how ML models made certain predictions to loan officers, forecasters, and customers.
+ **Data Science** – Data scientists and ML engineers can debug and improve ML models when they can determine if a model is making inferences based on noisy or irrelevant features. They can also understand the limitations of their models and failure modes that their models may encounter.

For a blog post that shows how to architect and build a complete machine learning model for fraudulent automobile claims that integrates SageMaker Clarify into a SageMaker AI pipeline, see the [Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/) demo. This blog post discusses how to assess and mitigate pre-training and post-training bias, and how the features impact the model prediction. The blog post contains links to example code for each task in the ML lifecycle.

### Best practices to evaluate fairness and explainability in the ML lifecycle
<a name="clarify-fairness-and-explainability-best-practices"></a>

**Fairness as a process** – Notions of bias and fairness depend on their application. The measurement of bias and the choice of the bias metrics may be guided by social, legal, and other non-technical considerations. The successful adoption of fairness-aware ML approaches includes building consensus and achieving collaboration across key stakeholders. These may include product, policy, legal, engineering, AI/ML teams, end users, and communities.

**Fairness and explainability by design in the ML lifecycle** – Consider fairness and explainability during each stage of the ML lifecycle. These stages include problem formation, dataset construction, algorithm selection, the model training process, the testing process, deployment, and monitoring and feedback. It is important to have the right tools to do this analysis. We recommend asking the following questions during the ML lifecycle:
+ Does the model encourage feedback loops that can produce increasingly unfair outcomes?
+ Is an algorithm an ethical solution to the problem?
+ Is the training data representative of different groups?
+ Are there biases in labels or features?
+ Does the data need to be modified to mitigate bias?
+ Do fairness constraints need to be included in the objective function?
+ Has the model been evaluated using relevant fairness metrics?
+ Are there unequal effects across users?
+ Is the model deployed on a population for which it was not trained or evaluated?

![\[Best practices for the process of evaluating fairness and model explainability.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify-best-practices-image.png)


### Guide to the SageMaker AI explanations and bias documentation
<a name="clarify-fairness-and-explainability-toc"></a>

Bias can occur and be measured in the data both before and after training a model. SageMaker Clarify can provide explanations for model predictions after training and for models deployed to production. SageMaker Clarify can also monitor models in production for any drift in their baseline explanatory attributions, and calculate baselines when needed. The documentation for explaining and detecting bias using SageMaker Clarify is structured as follows:
+ For information on setting up a processing job for bias and explainability, see [Configure a SageMaker Clarify Processing Job](clarify-processing-job-configure-parameters.md).
+ For information on detecting bias in pre-processing data before it's used to train a model, see [Pre-training Data Bias](clarify-detect-data-bias.md).
+ For information on detecting post-training data and model bias, see [Post-training Data and Model Bias](clarify-detect-post-training-bias.md).
+ For information on the model-agnostic feature attribution approach to explain model predictions after training, see [Model Explainability](clarify-model-explainability.md).
+ For information on monitoring for feature contribution drift away from the baseline that was established during model training, see [Feature attribution drift for models in production](clarify-model-monitor-feature-attribution-drift.md).
+ For information about monitoring models that are in production for baseline drift, see [Bias drift for models in production](clarify-model-monitor-bias-drift.md).
+ For information about obtaining explanations in real time from a SageMaker AI endpoint, see [Online explainability with SageMaker Clarify](clarify-online-explainability.md).

## How SageMaker Clarify Processing Jobs Work
<a name="clarify-processing-job-configure-how-it-works"></a>

You can use SageMaker Clarify to analyze your datasets and models for explainability and bias. A SageMaker Clarify processing job uses the SageMaker Clarify processing container to interact with an Amazon S3 bucket containing your input datasets. You can also use SageMaker Clarify to analyze a customer model that is deployed to a SageMaker AI inference endpoint.

The following graphic shows how a SageMaker Clarify processing job interacts with your input data and optionally, with a customer model. This interaction depends on the specific type of analysis being performed. The SageMaker Clarify processing container obtains the input dataset and configuration for analysis from an S3 bucket. For certain analysis types, including feature analysis, the SageMaker Clarify processing container must send requests to the model container. Then it retrieves the model predictions from the response that the model container sends. After that, the SageMaker Clarify processing container computes and saves analysis results to the S3 bucket.

![\[SageMaker Clarify can analyze your data or a customer model for explainability and bias.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/clarify-processing-job.png)


You can run a SageMaker Clarify processing job at multiple stages in the lifecycle of the machine learning workflow. SageMaker Clarify can help you compute the following analysis types:
+ Pre-training bias metrics. These metrics can help you understand the bias in your data so that you can address it and train your model on a more fair dataset. See [Pre-training Bias Metrics](clarify-measure-data-bias.md) for information about pre-training bias metrics. To run a job to analyze pre-training bias metrics, you must provide the dataset and a JSON analysis configuration file to [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).
+ Post-training bias metrics. These metrics can help you understand any bias introduced by an algorithm, hyperparameter choices, or any bias that wasn't apparent earlier in the flow. For more information about post-training bias metrics, see [Post-training Data and Model Bias Metrics](clarify-measure-post-training-bias.md). SageMaker Clarify uses the model predictions in addition to the data and labels to identify bias. To run a job to analyze post-training bias metrics, you must provide the dataset and a JSON analysis configuration file. The configuration should include the model or endpoint name.
+ Shapley values, which can help you understand what impact your feature has on what your model predicts. For more informaton about Shapley values, see [Feature Attributions that Use Shapley Values](clarify-shapley-values.md). This feature requires a trained model.
+ Partial dependence plots (PDPs), which can help you understand how much your predicted target variable would change if you varied the value of one feature. For more information about PDPs, see [Partial dependence plots (PDPs) analysis](clarify-processing-job-analysis-results.md#clarify-processing-job-analysis-results-pdp) This feature requires a trained model.

SageMaker Clarify needs model predictions to compute post-training bias metrics and feature attributions. You can provide an endpoint or SageMaker Clarify will create an ephemeral endpoint using your model name, also known as a *shadow endpoint*. The SageMaker Clarify container deletes the shadow endpoint after the computations are completed. At a high level, the SageMaker Clarify container completes the following steps:

1. Validates inputs and parameters.

1. Creates the shadow endpoint (if a model name is provided).

1. Loads the input dataset into a data frame.

1. Obtains model predictions from the endpoint, if necessary.

1. Computes bias metrics and features attributions.

1. Deletes the shadow endpoint.

1. Generate the analysis results.

After the SageMaker Clarify processing job is complete, the analysis results will be saved in the output location that you specified in the processing output parameter of the job. These results include a JSON file with bias metrics and global feature attributions, a visual report, and additional files for local feature attributions. You can download the results from the output location and view them.

For additional information about bias metrics, explainability and how to interpret them, see [Learn How Amazon SageMaker Clarify Helps Detect Bias](https://aws.amazon.com/blogs/machine-learning/learn-how-amazon-sagemaker-clarify-helps-detect-bias), [Fairness Measures for Machine Learning in Finance](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf), and the [Amazon AI Fairness and Explainability Whitepaper](https://pages.awscloud.com/rs/112-TZM-766/images/Amazon.AI.Fairness.and.Explainability.Whitepaper.pdf).

# Configure a SageMaker Clarify Processing Job
<a name="clarify-processing-job-configure-parameters"></a>

To analyze your data and models for bias and explainability using SageMaker Clarify, you must configure a SageMaker Clarify processing job. This guide shows how to specify the input dataset name, analysis configuration file name, and output location for a processing job. To configure the processing container, job inputs, outputs, resources and other parameters, you have two options. You can either use the SageMaker AI `CreateProcessingJob` API, or use the SageMaker AI Python SDK API `SageMaker ClarifyProcessor`,

For information about parameters that are common to all processing jobs, see [Amazon SageMaker API Reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html?icmpid=docs_sagemaker_lp).

## Configure a SageMaker Clarify processing job using the SageMaker API
<a name="clarify-processing-job-configure-parameters-API"></a>

The following instructions show how to provide each portion of the SageMaker Clarify specific configuration using the `CreateProcessingJob` API.

1. Input the uniform research identifier (URI) of a SageMaker Clarify container image inside the `AppSpecification` parameter, as shown in the following code example.

   ```
   {
       "ImageUri": "the-clarify-container-image-uri"
   }
   ```
**Note**  
The URI must identify a pre-built SageMaker Clarify container image. `ContainerEntrypoint` and `ContainerArguments` are not supported. For more information about SageMaker Clarify container images, see [Prebuilt SageMaker Clarify Containers](clarify-processing-job-configure-container.md).

1. Specify both the configuration for your analysis and parameters for your input dataset inside the `ProcessingInputs` parameter.

   1. Specify the location of the JSON analysis configuration file, which includes the parameters for bias analysis and explainability analysis. The `InputName` parameter of the `ProcessingInput` object must be **analysis\$1config** as shown in the following code example.

      ```
      {
          "InputName": "analysis_config",
          "S3Input": {
              "S3Uri": "s3://your-bucket/analysis_config.json",
              "S3DataType": "S3Prefix",
              "S3InputMode": "File",
              "LocalPath": "/opt/ml/processing/input/config"
          }
      }
      ```

      For more information about the schema of the analysis configuration file, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md) .

   1. Specify the location of the input dataset. The `InputName` parameter of the `ProcessingInput` object must be `dataset`. This parameter is optional if you have provided the "dataset\$1uri" in the analysis configuration file. The following values are required in the `S3Input` configuration.

      1. `S3Uri`can be either an Amazon S3 object or an S3 prefix.

      1. `S3InputMode` must be of type **File**.

      1. `S3CompressionType` must be of type `None` (the default value).

      1. `S3DataDistributionType` must be of type `FullyReplicated` (the default value).

      1. `S3DataType` can be either `S3Prefix` or `ManifestFile`. To use `ManifestFile`, the `S3Uri` parameter should specify the location of a manifest file that follows the schema from the SageMaker API Reference section [S3Uri](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#sagemaker-Type-S3DataSource-S3Uri). This manifest file must list the S3 objects that contain the input data for the job.

      The following code shows an example of an input configuration.

      ```
      {
          "InputName": "dataset",
          "S3Input": {
              "S3Uri": "s3://your-bucket/your-dataset.csv",
              "S3DataType": "S3Prefix",
              "S3InputMode": "File",
              "LocalPath": "/opt/ml/processing/input/data"
          }
      }
      ```

1. Specify the configuration for the output of the processing job inside the `ProcessingOutputConfig` parameter. A single `ProcessingOutput` object is required in the `Outputs` configuration. The following are required of the output configuration:

   1. `OutputName` must be **analysis\$1result**.

   1. `S3Uri`must be an S3 prefix to the output location.

   1. `S3UploadMode` must be set to **EndOfJob**.

   The following code shows an example of an output configuration.

   ```
   {
       "Outputs": [{ 
           "OutputName": "analysis_result",
           "S3Output": { 
               "S3Uri": "s3://your-bucket/result/",
               "S3UploadMode": "EndOfJob",
               "LocalPath": "/opt/ml/processing/output"
            }
        }]
   }
   ```

1. Specify the configuration `ClusterConfig` for the resources that you use in your processing job inside the `ProcessingResources` parameter. The following parameters are required inside the `ClusterConfig` object.

   1. `InstanceCount` specifies the number of compute instances in the cluster that runs the processing job. Specify a value greater than `1` to activate distributed processing.

   1. `InstanceType` refers to the resources that runs your processing job. Because SageMaker AI SHAP analysis is compute-intensive, using an instance type that is optimized for compute should improve runtime for analysis. The SageMaker Clarify processing job doesn't use GPUs.

   The following code shows an example of resource configuration.

   ```
   {
       "ClusterConfig": {
            "InstanceCount": 1,
            "InstanceType": "ml.m5.xlarge",
            "VolumeSizeInGB": 20
        }
   }
   ```

1. Specify the configuration of the network that you use in your processing job inside the `NetworkConfig` object. The following values are required in the configuration.

   1. `EnableNetworkIsolation` must be set to `False` (default) so that SageMaker Clarify can invoke an endpoint, if necessary, for predictions.

   1. If the model or endpoint that you provided to the SageMaker Clarify job is within an Amazon Virtual Private Cloud (Amazon VPC), then the SageMaker Clarify job must also be in the same VPC. Specify the VPC using [VpcConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html). Additionally, the VPC must have endpoints to an Amazon S3 bucket, SageMaker AI service and SageMaker AI Runtime service.

      If distributed processing is activated, you must also allow communication between different instances in the same processing job. Configure a rule for your security group that allows inbound connections between members of the same security group. For more information, see [Give Amazon SageMaker Clarify Jobs Access to Resources in Your Amazon VPC](clarify-vpc.md). 

   The following code gives an example of a network configuration.

   ```
   {
       "EnableNetworkIsolation": False,
       "VpcConfig": {
           ...
       }
   }
   ```

1. Set the maximum time that the job will run using the `StoppingCondition` parameter. The longest that a SageMaker Clarify job can run is `7` days or `604800` seconds. If the job cannot be completed within this time limit, it will be stopped and no analysis results will be provided. As an example, the following configuration limits the maximum time that the job can run to 3600 seconds.

   ```
   {
       "MaxRuntimeInSeconds": 3600
   }
   ```

1. Specify an IAM role for the `RoleArn` parameter. The role must have a trust relationship with Amazon SageMaker AI. It can be used to perform the SageMaker API operations listed in the following table. We recommend using the Amazon SageMaker AIFullAccess managed policy, which grants full access to SageMaker AI. For more information on this policy, see [AWS managed policy: AmazonSageMakerFullAccess](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSageMakerFullAccess). If you have concerns about granting full access, the minimal permissions required depend on whether you provide a model or an endpoint name. Using an endpoint name allows for granting fewer permissions to SageMaker AI.

   The following table contains API operations used by the SageMaker Clarify processing job. An **X** under **Model name** and **Endpoint name** notes the API operation that is required for each input.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-parameters.html)

   For more information about required permissions, see [Amazon SageMaker AI API Permissions: Actions, Permissions, and Resources Reference](api-permissions-reference.md).

   For more information about passing roles to SageMaker AI, see [Passing Roles](sagemaker-roles.md#sagemaker-roles-pass-role).

   After you have the individual pieces of the processing job configuration, combine them to configure the job.

## Configure a SageMaker Clarify processing job using the AWS SDK for Python
<a name="clarify-processing-job-configure-parameters-SDK"></a>

The following code example shows how to launch a SageMaker Clarify processing job using the [AWS SDK for Python](https://aws.amazon.com/sdk-for-python/).

```
sagemaker_client.create_processing_job(
    ProcessingJobName="your-clarify-job-name",
    AppSpecification={
        "ImageUri": "the-clarify-container-image-uri",
    },
    ProcessingInputs=[{
            "InputName": "analysis_config",
            "S3Input": {
                "S3Uri": "s3://your-bucket/analysis_config.json",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
                "LocalPath": "/opt/ml/processing/input/config",
            },
        }, {
            "InputName": "dataset",
            "S3Input": {
                "S3Uri": "s3://your-bucket/your-dataset.csv",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
                "LocalPath": "/opt/ml/processing/input/data",
            },
        },
    ],
    ProcessingOutputConfig={
        "Outputs": [{ 
            "OutputName": "analysis_result",
            "S3Output": { 
               "S3Uri": "s3://your-bucket/result/",
               "S3UploadMode": "EndOfJob",
               "LocalPath": "/opt/ml/processing/output",
            },   
        }],
    },
    ProcessingResources={
        "ClusterConfig": {
            "InstanceCount": 1,
            "InstanceType": "ml.m5.xlarge",
            "VolumeSizeInGB": 20,
        },
    },
    NetworkConfig={
        "EnableNetworkIsolation": False,
        "VpcConfig": {
            ...
        },
    },
    StoppingCondition={
        "MaxRuntimeInSeconds": 3600,
    },
    RoleArn="arn:aws:iam::<your-account-id>:role/service-role/AmazonSageMaker-ExecutionRole",
)
```

For an example notebook with instructions for running a SageMaker Clarify processing job using AWS SDK for Python, see [Fairness and Explainability with SageMaker Clarify using AWS SDK for Python](http://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_boto3.ipynb). Any S3 bucket used in the notebook must be in the same AWS Region as the notebook instance that accesses it.

## Configure a SageMaker Clarify processing job using the SageMaker Python SDK
<a name="clarify-processing-job-configure-parameters-SM-SDK"></a>

You can also configure a SageMaker Clarify processing job using the [SageMaker ClarifyProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor) in the SageMaker Python SDK API. For more information, see [Run SageMaker Clarify Processing Jobs for Bias Analysis and Explainability](clarify-processing-job-run.md).

**Topics**
+ [Prebuilt SageMaker Clarify Containers](clarify-processing-job-configure-container.md)
+ [Analysis Configuration Files](clarify-processing-job-configure-analysis.md)
+ [Data Format Compatibility Guide](clarify-processing-job-data-format.md)

# Prebuilt SageMaker Clarify Containers
<a name="clarify-processing-job-configure-container"></a>

Amazon SageMaker AI provides prebuilt SageMaker Clarify container images that include the libraries and other dependencies needed to compute bias metrics and feature attributions for explainability. These images are capable of running SageMaker Clarify [ processing jobs](processing-job.md) in your account.

The image URIs for the containers are in the following form:

```
<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-clarify-processing:1.0
```

For example:

```
111122223333.dkr.ecr.us-east-1.amazonaws.com/sagemaker-clarify-processing:1.0
```

The following table lists the addresses by AWS Region.

Docker Images for SageMaker Clarify Processing Jobs


| Region | Image address | 
| --- | --- | 
| US East (N. Virginia) | 205585389593.dkr.ecr.us-east-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| US East (Ohio) | 211330385671.dkr.ecr.us-east-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| US West (N. California) | 740489534195.dkr.ecr.us-west-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| US West (Oregon) | 306415355426.dkr.ecr.us-west-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Hong Kong) | 098760798382.dkr.ecr.ap-east-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Mumbai) | 452307495513.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Jakarta) | 705930551576.dkr.ecr.ap-southeast-3.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Tokyo) | 377024640650.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Seoul) | 263625296855.dkr.ecr.ap-northeast-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Osaka) | 912233562940.dkr.ecr.ap-northeast-3.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Singapore) | 834264404009.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Asia Pacific (Sydney) | 007051062584.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Canada (Central) | 675030665977.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Frankfurt) | 017069133835.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Zurich) | 730335477804.dkr.ecr.eu-central-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Ireland) | 131013547314.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (London) | 440796970383.dkr.ecr.eu-west-2.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Paris) | 341593696636.dkr.ecr.eu-west-3.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Stockholm) | 763603941244.dkr.ecr.eu-north-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Middle East (Bahrain) | 835444307964.dkr.ecr.me-south-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| South America (São Paulo) | 520018980103.dkr.ecr.sa-east-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Africa (Cape Town) | 811711786498.dkr.ecr.af-south-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| Europe (Milan) | 638885417683.dkr.ecr.eu-south-1.amazonaws.com/sagemaker-clarify-processing:1.0 | 
| China (Beijing) | 122526803553---dkr---ecr---cn-north-1.amazonaws.com.rproxy.govskope.ca.cn/sagemaker-clarify-processing:1.0 | 
| China (Ningxia) | 122578899357---dkr---ecr---cn-northwest-1.amazonaws.com.rproxy.govskope.ca.cn/sagemaker-clarify-processing:1.0 | 

# Analysis Configuration Files
<a name="clarify-processing-job-configure-analysis"></a>

To analyze your data and models for explainability and bias using SageMaker Clarify, you must configure a processing job. Part of the configuration for this processing job includes the configuration of an analysis file. The analysis file specifies the parameters for bias analysis and explainability. See [Configure a SageMaker Clarify Processing Job](clarify-processing-job-configure-parameters.md) to learn how to configure a processing job and analysis file.

This guide describes the schema and parameters for this analysis configuration file. This guide also includes examples of analysis configuration files for computing bias metrics for a tabular dataset, and generating explanations for natural language processing (NLP), computer vision (CV), and time series (TS) problems.

You can create the analysis configuration file or use the [SageMaker Python SDK](https://sagemaker.readthedocs.io/) to generate one for you with the [SageMaker ClarifyProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor) API. Viewing the file contents can be helpful for understanding the underlying configuration used by the SageMaker Clarify job.

**Topics**
+ [Schema for the analysis configuration file](#clarify-processing-job-configure-schema)
+ [Example analysis configuration files](#clarify-processing-job-configure-analysis-examples)

## Schema for the analysis configuration file
<a name="clarify-processing-job-configure-schema"></a>

The following section describes the schema for the analysis configuration file including requirements and descriptions of parameters.

### Requirements for the analysis configuration file
<a name="clarify-processing-job-configure-schema-requirements"></a>

 The SageMaker Clarify processing job expects the analysis configuration file to be structured with the following requirements:
+ The processing input name must be `analysis_config.`
+ The analysis configuration file is in JSON format, and encoded in UTF-8.
+ The analysis configuration file is an Amazon S3 object.

You can specify additional parameters in the analysis configuration file. The following section provides various options to tailor the SageMaker Clarify processing job for your use case and desired types of analysis.

### Parameters for analysis configuration files
<a name="clarify-processing-job-configure-analysis-parameters"></a>

In the analysis configuration file, you can specify the following parameters.
+ **version** – (Optional) The version string of the analysis configuration file schema. If a version is not provided, SageMaker Clarify uses the latest supported version. Currently, the only supported version is `1.0`.
+ **dataset\$1type** – The format of the dataset. The input dataset format can be any of the following values:
  + Tabular
    + `text/csv` for CSV
    + `application/jsonlines` for [SageMaker AI JSON Lines dense format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-jsonlines)
    + `application/json` for JSON
    + `application/x-parquet` for Apache Parquet
    + `application/x-image` to activate explainability for computer vision problems
  + Time series forecasting model explanations
    + `application/json` for JSON
+ **dataset\$1uri** – (Optional) The uniform resource identifier (URI) of the main dataset. If you provide a S3 URI prefix, the SageMaker Clarify processing job recursively collects all S3 files located under the prefix. You can provide either a S3 URI prefix or a S3 URI to an image manifest file for computer vision problems. If `dataset_uri` is provided, it takes precedence over the dataset processing job input. For any format type except image and time series use cases, the SageMaker Clarify processing job loads the input dataset into a tabular data frame, as a **tabular dataset**. This format allows SageMaker AI to easily manipulate and analyze the input dataset.
+ **headers** – (Optional)
  + **Tabular:** An array of strings containing the column names of a tabular dataset. If a value is not provided for `headers`, the SageMaker Clarify processing job reads the headers from the dataset. If the dataset doesn’t have headers, then the Clarify processing job automatically generates placeholder names based on zero-based column index. For example, placeholder names for the first and second columns will be **column\$10**, **column\$11**, and so on.
**Note**  
By convention, if `dataset_type` is `application/jsonlines` or `application/json`, then `headers` should contain the following names in order:  
feature names
label name (if `label` is specified)
predicted label name (if `predicted_label` is specified)
An example for `headers` for an `application/jsonlines` dataset type if `label` is specified is: `["feature1","feature2","feature3","target_label"]`.
  + **Time series:** A list of column names in the dataset. If not provided, Clarify generates headers to use internally. For time series explainability cases, provide headers in the following order:

    1. item id

    1. timestamp

    1. target time series

    1. all related time series columns

    1. all static covariate columns
+ **label** – (Optional) A string or a zero-based integer index. If provided, `label` is used to locate the ground truth label, also known as an observed label or target attribute in a tabular dataset. The ground truth label is used to compute bias metrics. The value for `label` is specified depending on the value of the `dataset_type` parameter as follows.
  + If `dataset_type` is **text/csv**, `label` can be specified as either of the following:
    + A valid column name
    + An index that lies within the range of dataset columns
  + If `dataset_type` is **application/parquet**, `label` must be a valid column name.
  + If `dataset_type` is **application/jsonlines**, `label` must be a [JMESPath](https://jmespath.org/) expression written to extract the ground truth label from the dataset. By convention, if `headers` is specified, then it should contain the label name.
  + If `dataset_type` is **application/json**, `label` must be a [JMESPath](https://jmespath.org/) expression written to extract the ground truth label for each record in the dataset. This JMESPath expression must produce a list of labels where the ith label correlates to the ith record.
+ **predicted\$1label** – (Optional) A string or a zero-based integer index. If provided, `predicted_label` is used to locate the column containing the predicted label in a tabular dataset. The predicted label is used to compute post-training **bias metrics**. The parameter `predicted_label` is optional if the dataset doesn’t include predicted label. If predicted labels are required for computation, then the SageMaker Clarify processing job will get predictions from the model.

  The value for `predicted_label` is specified depending on the value of the `dataset_type` as follows:
  + If `dataset_type` is **text/csv**, `predicted_label` can be specified as either of the following:
    + A valid column name. If `predicted_label_dataset_uri` is specified, but `predicted_label` is not provided, the default predicted label name is "predicted\$1label". 
    + An index that lies within the range of dataset columns. If `predicted_label_dataset_uri` is specified, then the index is used to locate the predicted label column in the predicted label dataset.
  + If dataset\$1type is **application/x-parquet**, `predicted_label` must be a valid column name.
  + If dataset\$1type is **application/jsonlines**, `predicted_label` must be a valid [JMESPath](https://jmespath.org/) expression written to extract the predicted label from the dataset. By convention, if `headers` is specified, then it should contain the predicted label name. 
  + If `dataset_type` is **application/json**, `predicted_label` must be a [JMESPath](https://jmespath.org/) expression written to extract the predicted label for each record in the dataset. The JMESPath expression should produce a list of predicted labels where the ith predicted label is for the ith record.
+ **features** – (Optional) Required for non-time-series use cases if `dataset_type` is `application/jsonlines` or `application/json`. A JMESPath string expression written to locate the features in the input dataset. For `application/jsonlines`, a JMESPath expression will be applied to each line to extract the features for that record. For `application/json`, a JMESPath expression will be applied to the whole input dataset. The JMESPath expression should extract a list of lists, or a 2D array/matrix of features where the ith row contains the features that correlate to the ith record. For a `dataset_type` of `text/csv` or `application/x-parquet`, all columns except for the ground truth label and predicted label columns are automatically assigned to be features.
+ **predicted\$1label\$1dataset\$1uri** – (Optional) Only applicable when dataset\$1type is `text/csv`. The S3 URI for a dataset containing predicted labels used to compute post-training **bias metrics**. The SageMaker Clarify processing job will load the predictions from the provided URI instead of getting predictions from the model. In this case, `predicted_label` is required to locate the predicted label column in the predicted label dataset. If the predicted label dataset or the main dataset is split across multiple files, an identifier column must be specified by `joinsource_name_or_index` to join the two datasets. 
+ **predicted\$1label\$1headers** – (Optional) Only applicable when `predicted_label_dataset_uri` is specified. An array of strings containing the column names of the predicted label dataset. Besides the predicted label header, `predicted_label_headers` can also contain the header of the identifier column to join the predicted label dataset and the main dataset. For more information, see the following description for the parameter `joinsource_name_or_index`.
+ **joinsource\$1name\$1or\$1index** – (Optional) The name or zero-based index of the column in tabular datasets to be used as a identifier column while performing an inner join. This column is only used as an identifier. It isn't used for any other computations like bias analysis or feature attribution analysis. A value for `joinsource_name_or_index` is needed in the following cases:
  + There are multiple input datasets, and any one is split across multiple files.
  + Distributed processing is activated by setting the SageMaker Clarify processing job [InstanceCount](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingClusterConfig.html#sagemaker-Type-ProcessingClusterConfig-InstanceCount) to a value greater than `1`.
+ **excluded\$1columns** – (Optional) An array of names or zero-based indices of columns to be excluded from being sent to the model as input for predictions. Ground truth label and predicted label are automatically excluded already. This feature is not supported for time series.
+ **probability\$1threshold** – (Optional) A floating point number above which, a label or object is selected. The default value is `0.5`. The SageMaker Clarify processing job uses `probability_threshold` in the following cases:
  + In post-training bias analysis, `probability_threshold` converts a numeric model prediction (probability value or score) to a binary label, if the model is a binary classifier. A score greater than the threshold is converted to `1`. Whereas, a score less than or equal to the threshold is converted to `0`.
  + In computer vision explainability problems, if model\$1type is **OBJECT\$1DETECTION**`, probability_threshold` filters out objects detected with confidence scores lower than the threshold value.
+ **label\$1values\$1or\$1threshold** – (Optional) Required for bias analysis. An array of label values or a threshold number, which indicate positive outcome for ground truth and predicted labels for bias metrics. For more information, see positive label values in [Amazon SageMaker Clarify Terms for Bias and Fairness](clarify-detect-data-bias.md#clarify-bias-and-fairness-terms). If the label is numeric, the threshold is applied as the lower bound to select the positive outcome. To set `label_values_or_threshold` for different problem types, refer to the following examples:
  + For a binary classification problem, the label has two possible values, `0` and `1`. If label value `1` is favorable to a demographic group observed in a sample, then `label_values_or_threshold` should be set to `[1]`.
  + For a multiclass classification problem, the label has three possible values, **bird**, **cat**, and **dog**. If the latter two define a demographic group that bias favors, then `label_values_or_threshold` should be set to `["cat","dog"]`.
  + For a regression problem, the label value is continuous, ranging from `0` to `1`. If a value greater than `0.5` should designate a sample as having a positive result, then `label_values_or_threshold` should be set to `0.5`.
+ **facet** – (Optional) Required for bias analysis. An array of facet objects, which are composed of sensitive attributes against which bias is measured. You can use facets to understand the bias characteristics of your dataset and model even if your model is trained without using sensitive attributes. For more information, see **Facet** in [Amazon SageMaker Clarify Terms for Bias and Fairness](clarify-detect-data-bias.md#clarify-bias-and-fairness-terms). Each facet object includes the following fields:
  + **name\$1or\$1index** – (Optional) The name or zero-based index of the sensitive attribute column in a tabular dataset. If `facet_dataset_uri` is specified, then the index refers to the facet dataset instead of the main dataset.
  + **value\$1or\$1threshold** – (Optional) Required if `facet` is numeric and `label_values_or_threshold` is applied as the lower bound to select the sensitive group). An array of facet values or a threshold number, that indicates the sensitive demographic group that bias favors. If facet data type is categorical and `value_or_threshold` is not provided, bias metrics are computed as one group for every unique value (rather than all values). To set `value_or_threshold` for different `facet` data types, refer to the following examples:
    + For a binary facet data type, the feature has two possible values, `0` and `1`. If you want to compute the bias metrics for each value, then `value_or_threshold` can be either omitted or set to an empty array.
    + For a categorical facet data type, the feature has three possible values **bird**, **cat**, and **dog**. If the first two define a demographic group that bias favors, then `value_or_threshold` should be set to `["bird", "cat"]`. In this example, the dataset samples are split into two demographic groups. The facet in the advantaged group has value **bird** or **cat**, while the facet in the disadvantaged group has value **dog**.
    + For a numeric facet data type, the feature value is continuous, ranging from `0` to `1`. As an example, if a value greater than `0.5` should designate a sample as favored, then `value_or_threshold` should be set to `0.5`. In this example, the dataset samples are split into two demographic groups. The facet in the advantaged group has value greater than `0.5`, while the facet in the disadvantaged group has value less than or equal to `0.5`.
+ **group\$1variable** – (Optional) The name or zero-based index of the column that indicates the subgroup to be used for the bias metric [Conditional Demographic Disparity (CDD)](clarify-data-bias-metric-cddl.md) or [Conditional Demographic Disparity in Predicted Labels (CDDPL)](clarify-post-training-bias-metric-cddpl.md).
+ **facet\$1dataset\$1uri** – (Optional) Only applicable when dataset\$1type is `text/csv`. The S3 URI for a dataset containing sensitive attributes for bias analysis. You can use facets to understand the bias characteristics of your dataset and model even if your model is trained without using sensitive attributes.
**Note**  
If the facet dataset or the main dataset is split across multiple files, an identifier column must be specified by `joinsource_name_or_index` to join the two datasets. You must use the parameter `facet` to identify each facet in the facet dataset.
+ **facet\$1headers** – (Optional) Only applicable when `facet_dataset_uri` is specified. An array of strings containing column names for the facet dataset, and optionally, the identifier column header to join the facet dataset and the main dataset, see `joinsource_name_or_index`.
+ **time\$1series\$1data\$1config** – (Optional) Specifies the configuration to use for data processing of a time series. 
  + **item\$1id** – A string or a zero-based integer index. This field is used to locate an item id in the shared input dataset.
  + **timestamp** – A string or a zero-based integer index. This field is used to locate a timestamp in the shared input dataset.
  + **dataset\$1format** – Possible values are `columns`, `item_records`, or `timestamp_records`. This field is used to describe the format of a JSON dataset, which is the only format supported for time series explainability.
  + **target\$1time\$1series** – A JMESPath string or a zero-based integer index. This field is used to locate the target time series in the shared input dataset. If this parameter is a string, then all other parameters except `dataset_format` must be strings or lists of strings. If this parameter is an integer, then all other parameters except `dataset_format` must be integers or lists of integers.
  + **related\$1time\$1series** – (Optional) An array of JMESPath expressions. This field is used to locate all related time series in the shared input dataset, if present.
  + **static\$1covariates** – (Optional) An array of JMESPath expressions. This field is used to locate all static covariate fields in the shared input dataset, if present.

  For examples, see [Time series dataset config examples](clarify-processing-job-data-format-time-series.md#clarify-processing-job-data-format-time-series-ex).
+ **methods** – An object containing one or more analysis methods and their parameters. If any method is omitted, it is neither used for analysis nor reported.
  + **pre\$1training\$1bias** – Include this method if you want to compute pre-training bias metrics. The detailed description of the metrics can be found in [Pre-training Bias Metrics](clarify-measure-data-bias.md). The object has the following parameters:
    + **methods** – An array that contains any of the pre-training bias metrics from the following list that you want to compute. Set `methods` to **all** to compute all pre-training bias metrics. As an example, the array `["CI", "DPL"]` will compute **Class Imbalance** and **Difference in Proportions of Labels**.
      + `CI` for [Class Imbalance (CI)](clarify-bias-metric-class-imbalance.md)
      + `DPL` for [Difference in Proportions of Labels (DPL)](clarify-data-bias-metric-true-label-imbalance.md)
      + `KL` for [Kullback-Leibler Divergence (KL)](clarify-data-bias-metric-kl-divergence.md)
      + `JS` for [Jensen-Shannon Divergence (JS)](clarify-data-bias-metric-jensen-shannon-divergence.md)
      + `LP` for [Lp-norm (LP)](clarify-data-bias-metric-lp-norm.md)
      + `TVD` for [Total Variation Distance (TVD)](clarify-data-bias-metric-total-variation-distance.md)
      + `KS` for [Kolmogorov-Smirnov (KS)](clarify-data-bias-metric-kolmogorov-smirnov.md)
      + `CDDL` for [Conditional Demographic Disparity (CDD)](clarify-data-bias-metric-cddl.md)
  + **post\$1training\$1bias** – Include this method if you want to compute post-training bias metrics. The detailed description of the metrics can be found in [Post-training Data and Model Bias Metrics](clarify-measure-post-training-bias.md). The `post_training_bias` object has the following parameters.
    + **methods** – An array that contains any of the post-training bias metrics from the following list that you want to compute. Set `methods` to **all** to compute all post-training bias metrics. As an example, the array `["DPPL", "DI"]` computes the **Difference in Positive Proportions in Predicted Labels** and **Disparate Impact**. The available methods are as follows.
      + `DPPL` for [Difference in Positive Proportions in Predicted Labels (DPPL)](clarify-post-training-bias-metric-dppl.md)
      + `DI`for [Disparate Impact (DI)](clarify-post-training-bias-metric-di.md)
      + `DCA` for [Difference in Conditional Acceptance (DCAcc)](clarify-post-training-bias-metric-dcacc.md)
      + `DCR` for [Difference in Conditional Rejection (DCR)](clarify-post-training-bias-metric-dcr.md)
      + `SD` for [Specificity difference (SD)](clarify-post-training-bias-metric-sd.md)
      + `RD` for [Recall Difference (RD)](clarify-post-training-bias-metric-rd.md)
      + `DAR` for [Difference in Acceptance Rates (DAR)](clarify-post-training-bias-metric-dar.md)
      + `DRR` for [Difference in Rejection Rates (DRR)](clarify-post-training-bias-metric-drr.md)
      + `AD` for [Accuracy Difference (AD)](clarify-post-training-bias-metric-ad.md)
      + `TE` for [Treatment Equality (TE)](clarify-post-training-bias-metric-te.md)
      + `CDDPL` for [Conditional Demographic Disparity in Predicted Labels (CDDPL)](clarify-post-training-bias-metric-cddpl.md)
      + `FT` for [Counterfactual Fliptest (FT)](clarify-post-training-bias-metric-ft.md)
      + `GE` for [Generalized entropy (GE)](clarify-post-training-bias-metric-ge.md)
  + **shap** – Include this method if you want to compute SHAP values. The SageMaker Clarify processing job supports the Kernel SHAP algorithm. The `shap` object has the following parameters.
    + **baseline** – (Optional) The SHAP baseline dataset, also known as the background dataset. Additional requirements for the baseline dataset in a tabular dataset or computer vision problem are as follows. For more information about SHAP Baselines, see [SHAP Baselines for Explainability](clarify-feature-attribute-shap-baselines.md)
      + For a **tabular** dataset, `baseline` can be either the in-place baseline data or the S3 URI of a baseline file. If `baseline` is not provided, the SageMaker Clarify processing job computes a baseline by clustering the input dataset. The following are required of the baseline:
        + The format must be the same as the dataset format specified by `dataset_type`.
        + The baseline can only contain features that the model can accept as input.
        + The baseline dataset can have one or more instances. The number of baseline instances directly affects the synthetic dataset size and job runtime.
        + If `text_config` is specified, then the baseline value of a text column is a string used to replace the unit of text specified by `granularity`. For example, one common placeholder is "[MASK]", which is used to represent a missing or unknown word or piece of text. 

        The following examples show how to set in-place baseline data for different `dataset_type` parameters:
        + If `dataset_type` is either `text/csv` or `application/x-parquet`, the model accepts four numeric features, and the baseline has two instances. In this example, if one record has all zero feature values and the other record has all one feature values, then baseline should be set to `[[0,0,0,0],[1,1,1,1]]`, without any header.
        + If `dataset_type` is `application/jsonlines`, and `features` is the key to a list of four numeric feature values. In addition, in this example, if the baseline has one record of all zero values, then `baseline` should be `[{"features":[0,0,0,0]}]`.
        + If `dataset_type` is `application/json`, the `baseline` dataset should have the same structure and format as the input dataset.
      + For **computer vision** problems, `baseline` can be the S3 URI of an image that is used to mask out features (segments) from the input image. The SageMaker Clarify processing job loads the mask image and resizes it to the same resolution as the input image. If baseline is not provided, the SageMaker Clarify processing job generates a mask image of [white noise](https://en.wikipedia.org/wiki/White_noise) at the same resolution as the input image.
    + **features\$1to\$1explain** – (Optional) An array of strings or zero-based indices of feature columns to compute SHAP values for. If `features_to_explain` is not provided, SHAP values are computed for all feature columns. These feature columns cannot include the label column or predicted label column. The `features_to_explain` parameter is only supported for tabular datasets with numeric and categorical columns.
    + **num\$1clusters** – (Optional) The number of clusters that the dataset is divided into to compute the baseline dataset. Each cluster is used to compute one baseline instance. If `baseline` is not specified, the SageMaker Clarify processing job attempts to compute the baseline dataset by dividing the tabular dataset into an optimal number of clusters between `1` and `12`. The number of baseline instances directly affects the runtime of SHAP analysis.
    + **num\$1samples** – (Optional) The number of samples to be used in the Kernel SHAP algorithm. If `num_samples` is not provided, the SageMaker Clarify processing job chooses the number for you. The number of samples directly affects both the synthetic dataset size and job runtime.
    + **seed** –(Optional) An integer used to initialize the pseudo random number generator in the SHAP explainer to generate consistent SHAP values for the same job. If seed is not specified, then each time that the same job runs, the model may output slightly different SHAP values. 
    + **use\$1logit** – (Optional) A Boolean value that indicates that you want the logit function to be applied to the model predictions. Defaults to `false`. If `use_logit` is `true`, then the SHAP values are calculated using the logistic regression coefficients, which can be interpreted as log-odds ratios.
    + **save\$1local\$1shap\$1values** – (Optional) A Boolean value that indicates that you want the local SHAP values of each record in the dataset to be included in the analysis result. Defaults to `false`.

      If the main dataset is split across multiple files or distributed processing is activated, also specify an identifier column using the parameter `joinsource_name_or_index`. The identifier column and the local SHAP values are saved in the analysis result. This way, you can map each record to its local SHAP values.
    + **agg\$1method** – (Optional) The method used to aggregate the local SHAP values (the SHAP values for each instance) of all instances to the global SHAP values (the SHAP values for the entire dataset). Defaults to `mean_abs`. The following methods can be used to aggregate SHAP values.
      + **mean\$1abs** – The mean of absolute local SHAP values of all instances.
      + **mean\$1sq** – The mean of squared local SHAP values of all instances.
      + **median** – The median of local SHAP values of all instances.
    + **text\$1config** – Required for natural language processing explainability. Include this configuration if you want to treat text columns as text and explanations should be provided for individual units of text. For an example of an analysis configuration for natural language processing explainability, see [Analysis configuration for natural language processing explainability](#clarify-analysis-configure-nlp-example)
      + **granularity** – The unit of granularity for the analysis of text columns. Valid values are `token`, `sentence`, or `paragraph`. **Each unit of text is considered a feature**, and local SHAP values are computed for each unit.
      + **language** – The language of the text columns. Valid values are **chinese**, **danish**, **dutch**, **english**, **french**, **german**, **greek**, **italian**, **japanese**, **lithuanian**, **multi-language**, **norwegian bokmål**, **polish**, **portuguese**, **romanian**, **russian**, **spanish**, **afrikaans**, **albanian**, **arabic**, **armenian**, **basque**, **bengali**, **bulgarian**, **catalan**, **croatian**, **czech**, **estonian**, **finnish**, **gujarati**, **hebrew**, **hindi**, **hungarian**, **icelandic**, **indonesian**, **irish**, **kannada**, **kyrgyz**, **latvian**, **ligurian**, **luxembourgish**, **macedonian**, **malayalam**, **marathi**, **nepali**, **persian**, **sanskrit**, **serbian**, **setswana**, **sinhala**, **slovak**, **slovenian**, **swedish**, **tagalog**, **tamil**, **tatar**, **telugu**, **thai**, **turkish**, **ukrainian**, **urdu**, **vietnamese**, **yoruba**. Enter `multi-language` for a mix of multiple languages.
      + **max\$1top\$1tokens** – (Optional) The maximum number of top tokens, based on global SHAP values. Defaults to `50`. It is possible for a token to appear multiple times in the dataset. The SageMaker Clarify processing job aggregates the SHAP values of each token, and then selects the top tokens based on their global SHAP values. The global SHAP values of the selected top tokens are included in the `global_top_shap_text` section of the analysis.json file.
      + The local SHAP value of aggregation.
    + **image\$1config** – Required for computer vision explainability. Include this configuration if you have an input dataset consisting of images and you want to analyze them for explainability in a computer vision problem.
      + **model\$1type** – The type of the model. Valid values include:
        + `IMAGE_CLASSIFICATION` for an image classification model.
        + `OBJECT_DETECTION` for an object detection model.
      + **max\$1objects** – Applicable only when model\$1type is **OBJECT\$1DETECTION**.The max number of objects, ordered by confidence score, detected by the computer vision model. Any objects ranked lower than the top max\$1objects by confidence score are filtered out. Defaults to `3`.
      + **context** – Applicable only when model\$1type is **OBJECT\$1DETECTION**. It indicates if the area around the bounding box of the detected object is masked by the baseline image or not. Valid values are `0` to mask everything, or `1` to mask nothing. Defaults to 1.
      + **iou\$1threshold** – Applicable only when `model_type` is **OBJECT\$1DETECTION**.The minimum intersection over union (IOU) metric for evaluating predictions against the original detection. A high IOU metric corresponds to a large overlap between the predicted and ground truth detection box. Defaults to `0.5`.
      + **num\$1segments** – (Optional) An integer that determines the approximate number of segments to be labeled in the input image. Each segment of the image is considered a feature, and local SHAP values are computed for each segment. Defaults to `20`.
      + **segment\$1compactness** – (Optional) An integer that determines the shape and size of the image segments generated by the [scikit-image slic](https://scikit-image.org/docs/dev/api/skimage.segmentation.html#skimage.segmentation.slic) method. Defaults to `5`.
  + **pdp** – Include this method to compute partial dependence plots (PDPs). For an example of an analysis configuration to generate PDPs, see [Compute partial dependence plots (PDPs)](#clarify-analysis-configure-csv-example-pdp)
    + **features** – Mandatory if the `shap` method is not requested. An array of feature names or indices to compute and plot PDP plots.
    + **top\$1k\$1features** – (Optional) Specifies the number of top features used to generate PDP plots. If `features` is not provided, but the `shap` method is requested, then the SageMaker Clarify processing job chooses the top features based on their SHAP attributions. Defaults to `10`.
    + **grid\$1resolution** – The number of buckets to divide the range of numeric values into. This specifies the granularity of the grid for the PDP plots.
  + **asymmetric\$1shapley\$1value** – Include this method if you want to compute explainability metrics for time-series forecasting models. The SageMaker Clarify processing job supports the asymmetric Shapley values algorithm. Asymmetric Shapley values are a variant of the Shapley value that drop the symmetry axiom. For more information, see [Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability](https://arxiv.org/abs/1910.06358). Use these values to determine how features contribute to the forecasting outcome. Asymmetric Shapley values take into account the temporal dependencies of the time series data that forecasting models take as input.

    The algorithm includes the following parameters:
    + **direction** – Available types are `chronological`, `anti_chronological`, and `bidirectional`. The temporal structure can be navigated in chronological or anti-chronological order or both. Chronological explanations are built by iteratively adding information from the first time step onward. Anti-chronological explanations add information starting from the last step and moving backward. The latter order may be more appropriate in the presence of recency bias, such as for forecasting stock prices.
    + **granularity** – The explanation granularity to be used. The available granularity options are shown as follows:
      + **timewise** – `timewise` explanations are inexpensive and provide information about specific time steps only, such as figuring out how much the information of the nth day in the past contributed to the forecasting of the mth day in the future. The resulting attributions do not explain individually static covariates and do not differentiate between target and related time series.
      + **fine\$1grained** – `fine_grained` explanations are computationally more intensive but provide a full breakdown of all attributions of the input variables. The method computes approximate explanations to reduce runtime. For more information, see the following parameter `num_samples`.
**Note**  
`fine_grained` explanations only support `chronological` order.
    + **num\$1samples** – (Optional) This argument is required for `fine_grained` explanations. The higher the number, the more precise the approximation. This number should scale with the dimensionality of the input features. A rule of thumb is to set this variable to *(1 \$1 max(number of related time series, number of static covariates))^2* if the result is not too big.
    + **baseline** – (Optional) The baseline config to replace out-of-coalition values for the corresponding datasets (also known as background data). The following snippet shows an example of a baseline config:

      ```
      {
          "related_time_series": "zero",
          "static_covariates": {
              <item_id_1>: [0, 2],
              <item_id_2>: [-1, 1]
          },
          "target_time_series": "zero"
      }
      ```
      + For temporal data such as target time series or related time series, the baseline value types can be one of the following values:
        + `zero` — All out-of-coalition values are replaced with 0.0.
        + `mean` — All out-of-coalition values are replaced with the average of a time series.
      + For static covariates, a baseline entry should only be provided when the model request takes static covariate values, in which case this field is required. The baseline should be provided for every item as a list. For example, if you have a dataset with two static covariates, your baseline config could be the following:

        ```
        "static_covariates": {
            <item_id_1>: [1, 1],
            <item_id_2>: [0, 1]
        }
        ```

        In the preceding example, *<item\$1id\$11>* and *<item\$1id\$12>* are the item ids from the dataset.
  + **report** – (Optional) Use this object to customize the analysis report. This parameter is not supported for time series explanation jobs. There are three copies of the same report as part of the analysis result: Jupyter Notebook report, HTML report, and PDF report. The object has the following parameters:
    + **name** – File name of the report files. For example, if `name` is **MyReport**, then the report files are `MyReport.ipynb`, `MyReport.html`, and `MyReport.pdf`. Defaults to `report`.
    + **title** – (Optional) Title string for the report. Defaults to **SageMaker AI Analysis Report**.
+ **predictor** – Required if the analysis requires predictions from the model. For example, when the `shap`, `asymmetric_shapley_value`, `pdp`, or `post_training_bias` method is requested, but predicted labels are not provided as part of the input dataset. The following are parameters to be used in conjunction with `predictor`:
  + **model\$1name** – The name of your SageMaker AI model created by the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. If you specify `model_name` instead of endpoint\$1name, the SageMaker Clarify processing job creates an ephemeral endpoint with the model name, known as a **shadow endpoint**, and gets predictions from the endpoint. The job deletes the shadow endpoint after the computations are completed. If the model is multi-model, then the `target_model` parameter must be specified. For more information about multi-model endpoints, see [Multi-model endpoints](multi-model-endpoints.md).
  + **endpoint\$1name\$1prefix** – (Optional) A custom name prefix for the shadow endpoint. Applicable if you provide `model_name` instead of `endpoint_name`. For example, provide `endpoint_name_prefix` if you want to restrict access to the endpoint by endpoint name. The prefix must match the [EndpointName](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html#sagemaker-CreateEndpoint-request-EndpointName) pattern, and its maximum length is `23`. Defaults to `sm-clarify`.
  + **initial\$1instance\$1count** – Specifies the number of instances for the shadow endpoint. Required if you provide model\$1name instead of endpoint\$1name. The value for `initial_instance_count` can be different from the [InstanceCount](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_ProcessingClusterConfig.html#sagemaker-Type-ProcessingClusterConfig-InstanceCount) of the job, but we recommend a 1:1 ratio.
  + **instance\$1type** – Specifies the instance type for the shadow endpoint. Required if you provide `model_name` instead of `endpoint_name`. As an example, `instance_type` can be set to "ml.m5.large". In some cases, the value specified for `instance_type` can help reduce model inference time. For example, to run efficiently, natural language processing models and computer vision models typically require a graphics processing unit (GPU) instance type.
  + **endpoint\$1name** – The name of your SageMaker AI endpoint created by the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API. If provided, `endpoint_name` takes precedence over the `model_name` parameter. Using an existing endpoint reduces the shadow endpoint bootstrap time, but it can also cause a significant increase in load for that endpoint. Additionally, some analysis methods (such as `shap` and `pdp`) generate synthetic datasets that are sent to the endpoint. This can cause the endpoint's metrics or captured data to be contaminated by synthetic data, which may not accurately reflect real-world usage. For these reasons, it's generally not recommended to use an existing production endpoint for SageMaker Clarify analysis.
  + **target\$1model** – The string value that is passed on to the TargetModel parameter of the SageMaker AI [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#RequestSyntax) API. Required if your model (specified by the model\$1name parameter) or endpoint (specified by the endpoint\$1name parameter) is multi-model. For more information about multi-model endpoints, see [Multi-model endpoints](multi-model-endpoints.md).
  + **custom\$1attributes** – (Optional) A string that allows you to provide additional information about a request for an inference that is submitted to the endpoint. The string value is passed to the `CustomAttributes` parameter of the SageMaker AI [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#RequestSyntax) API.
  + **content\$1type** – content\$1type – The model input format to be used for getting predictions from the endpoint. If provided, it is passed to the `ContentType` parameter of the SageMaker AI [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#RequestSyntax) API. 
    + For computer vision explainability, the valid values are **image/jpeg**, **image/png** or **application/x-npy**. If `content_type` is not provided, the default value is **image/jpeg**.
    + For time series forecasting explainability, the valid value is **application/json**.
    + For other types of explainability, the valid values are **text/csv**, **application/jsonlines,** and **application/json**. A value for `content_type` is required if the `dataset_type` is **application/x-parquet**. Otherwise `content_type` defaults to the value of the `dataset_type` parameter.
  + **accept\$1type** – The model output format to be used for getting predictions from the endpoint. The value for `accept_type` is passed to the `Accept` parameter of the SageMaker AI [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#RequestSyntax) API.
    + For computer vision explainability, if `model_type` is "OBJECT\$1DETECTION" then `accept_type` defaults to **application/json**.
    + For time series forecasting explainability, the valid value is **application/json**.
    + For other types of explainability, the valid values are **text/csv**, **application/jsonlines**, and **application/json**. If a value for `accept_type` is not provided, `accept_type` defaults to the value of the `content_type` parameter.
  + **content\$1template** – A template string used to construct the model input from dataset records. The parameter `content_template` is only used and required if the value of the `content_type` parameter is either `application/jsonlines` or `application/json`. 

    When the `content_type` parameter is `application/jsonlines`, the template should have only one placeholder, `$features`, which is replaced by a features list at runtime. For example, if the template is `"{\"myfeatures\":$features}"`, and if a record has three numeric feature values: `1`, `2` and `3`, then the record will be sent to the model as JSON Line `{"myfeatures":[1,2,3]}`. 

    When the `content_type` is `application/json`, the template can have either placeholder `$record` or `records`. If the placeholder is `record`, a single record is replaced with a record that has the template in `record_template` applied to it. In this case, only a single record will be sent to the model at a time. If the placeholder is `$records`, the records are replaced by a list of records, each with a template supplied by `record_template`.
  + **record\$1template** – A template string to be used to construct each record of the model input from dataset instances. It is only used and required when `content_type` is `application/json`. The template string may contain one of the following:
    + A placeholder `$features` parameter that is substituted by an array of feature values. An additional optional placeholder can substitute feature column header names in `$feature_names`. This optional placeholder will be substituted by an array of feature names.
    + Exactly one placeholder `$features_kvp` that is substituted by the key-value pairs, feature name and feature value.
    + A feature in the `headers` configuration. As an example, a feature name `A`, notated by the placeholder syntax `"${A}"` will be substituted by the feature value for `A`.

    The value for `record_template` is used with `content_template` to construct the model input. A configuration example showing how to construct a model input using a content and record template follows.

    In the following code example, the headers and features are defined as follows.
    + ``headers`:["A", "B"]`
    + ``features`:[[0,1], [3,4]]`

    The example model input is as follows.

    ```
    {
        "instances": [[0, 1], [3, 4]],
        "feature_names": ["A", "B"]
    }
    ```

    The example `content_template` and `record_template` parameter values to construct the previous example model input follows.
    + `content_template: "{\"instances\": $records, \"feature_names\": $feature_names}"`
    + `record_template: "$features"`

     In the following code example, the headers and features are defined as follows.

    ```
    [
        { "A": 0, "B": 1 },
        { "A": 3, "B": 4 },
    ]
    ```

    The example` content_template` and `record_template` parameter values to construct the previous example model input follows. 
    + `content_template: "$records"`
    + `record_template: "$features_kvp"`

    An alternate code example to construct the previous example model input follows.
    + `content_template: "$records"`
    + `record_template: "{\"A\": \"${A}\", \"B\": \"${B}\"}"`

     In the following code example, the headers and features are defined as follows.

    ```
    { "A": 0, "B": 1 }
    ```

    The example content\$1template and record\$1template parameters values to construct above: the previous example model input follows.
    + `content_template: "$record"`
    + `record_template: "$features_kvp"`

    For more examples, see [Endpoint requests for time series data](clarify-processing-job-data-format-time-series-request-jsonlines.md).
  + **label** – (Optional) A zero-based integer index or JMESPath expression string used to extract predicted labels from the model output for bias analysis. If the model is multiclass and the `label` parameter extracts all of the predicted labels from the model output, then the following apply. This feature is not supported for time series.
    + The `probability` parameter is required to get the corresponding probabilities (or scores) from the model output.
    + The predicted label of the highest score is chosen.

    The value for `label` depends on the value of the accept\$1type parameter as follows.
    + If `accept_type` is **text/csv**, then `label` is the index of any predicted labels in the model output.
    + If `accept_type` is **application/jsonlines** or **application/json**, then `label` is a JMESPath expression that's applied to the model output to get the predicted labels.
  + **label\$1headers** – (Optional) An array of values that the label can take in the dataset. If bias analysis is requested, then the `probability` parameter is also required to get the corresponding probability values (scores) from model output, and the predicted label of the highest score is chosen. If explainability analysis is requested, the label headers are used to beautify the analysis report. A value for `label_headers` is required for computer vision explainability. For example, for a multiclass classification problem, if the label has three possible values, **bird**, **cat**, and **dog**, then `label_headers` should be set to `["bird","cat","dog"]`.
  + **probability** – (Optional) A zero-based integer index or a JMESPath expression string used to extract probabilities (scores) for explainability analysis (but not for time series explainability), or to choose the predicted label for bias analysis. The value of `probability` depends on the value of the `accept_type` parameter as follows.
    + If `accept_type` is **text/csv**, `probability` is the index of the probabilities (scores) in the model output. If `probability` is not provided, the entire model output is taken as the probabilities (scores).
    + If `accept_type` is JSON data (either **application/jsonlines** or **application/json**), `probability` should be a JMESPath expression that is used to extract the probabilities (scores) from the model output.
  + **time\$1series\$1predictor\$1config** – (Optional) Used only for time series explainability. Used to instruct the SageMaker Clarify processor how to parse data correctly from the data passed as an S3 URI in `dataset_uri`.
    + **forecast** – A JMESPath expression used to extract the forecast result.

## Example analysis configuration files
<a name="clarify-processing-job-configure-analysis-examples"></a>

The following sections contain example analysis configuration files for data in CSV format, JSON Lines format, and for natural language processing (NLP), computer vision (CV), and time series (TS) explainability.

### Analysis configuration for a CSV dataset
<a name="clarify-analysis-configure-csv-example"></a>

The following examples show how to configure bias and explainability analysis for a tabular dataset in CSV format. In these examples, the incoming dataset has four feature columns, and one binary label column, `Target`. The contents of the dataset are as follows. A label value of `1` indicates a positive outcome. The dataset is provided to the SageMaker Clarify job by the `dataset` processing input.

```
"Target","Age","Gender","Income","Occupation"
0,25,0,2850,2
1,36,0,6585,0
1,22,1,1759,1
0,48,0,3446,1
...
```

The following sections show how to compute pre-training and post-training bias metrics, SHAP values, and partial dependence plots (PDPs) showing feature importance for a dataset in CSV format. 

#### Compute all of the pre-training bias metrics
<a name="clarify-analysis-configure-csv-example-metrics"></a>

This example configuration shows how to measure if the previous sample dataset is favorably biased towards samples with a **Gender** value of `0`. The following analysis configuration instructs the SageMaker Clarify processing job to compute all the pre-training bias metrics for the dataset.

```
{
    "dataset_type": "text/csv",
    "label": "Target",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        }
    }
}
```

#### Compute all of the post-training bias metrics
<a name="clarify-analysis-configure-csv-example-postmetrics"></a>

You can compute pre-training bias metrics prior to training. However, you must have a trained model to compute post-training bias metrics. The following example output is from a binary classification model that outputs data in CSV format. In this example output, each row contains two columns. The first column contains the predicted label, and the second column contains the probability value for that label. 

```
0,0.028986845165491
1,0.825382471084594
...
```

The following configuration example instructs the SageMaker Clarify processing job to compute all possible bias metrics using the dataset and the predictions from the model output. In the example, the model is deployed to a SageMaker AI endpoint `your_endpoint`.

**Note**  
In the following example code, the parameter `content_type` and `accept_type` are not set. Therefore, they automatically use the value of the parameter dataset\$1type, which is `text/csv`.

```
{
    "dataset_type": "text/csv",
    "label": "Target",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "label": 0
    }
}
```

#### Compute the SHAP values
<a name="clarify-analysis-configure-csv-example-shap"></a>

The following example analysis configuration instructs the job to compute the SHAP values designating the `Target` column as labels and all other columns as features.

```
{
    "dataset_type": "text/csv",
    "label": "Target",
    "methods": {
        "shap": {
            "num_clusters": 1
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "probability": 1
    }
}
```

In this example, the SHAP `baseline` parameter is omitted and the value of the `num_clusters` parameter is `1`. This instructs the SageMaker Clarify processor to compute one SHAP baseline sample. In this example, probability is set to `1`. This instructs the SageMaker Clarify processing job to extract the probability score from the second column of the model output (using zero-based indexing).

#### Compute partial dependence plots (PDPs)
<a name="clarify-analysis-configure-csv-example-pdp"></a>

The following example shows how to view the importance of the `Income` feature on the analysis report using PDPs. The report parameter instructs the SageMaker Clarify processing job to generate a report. After the job completes, the generated report is saved as report.pdf to the `analysis_result` location. The `grid_resolution` parameter divides the range of the feature values into `10` buckets. Together, the parameters specified in the following example instruct the SageMaker Clarify processing job to generate a report containing a PDP graph for `Income` with `10` segments on the x-axis. The y-axis will show the marginal impact of `Income` on the predictions.

```
{
    "dataset_type": "text/csv",
    "label": "Target",
    "methods": {
        "pdp": {
            "features": ["Income"],
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "probability": 1
    },
}
```

#### Compute both bias metrics and feature importance
<a name="clarify-analysis-configure-csv-example-fi"></a>

 You can combine all the methods from the previous configuration examples into a single analysis configuration file and compute them all by a single job. The following example shows an analysis configuration with all steps combined. 

In this example, the `probability` parameter is set to `1` to indicate that probabilities are contained in the second column (using zero-based indexing). However, because bias analysis needs a predicted label, the `probability_threshold` parameter is set to `0.5` to convert the probability score into a binary label. In this example, the `top_k_features` parameter of the partials dependence plots `pdp` method is set to `2`. This instructs the SageMaker Clarify processing job to compute partials dependence plots (PDPs) for the top `2` features with the largest global SHAP values. 

```
{
    "dataset_type": "text/csv",
    "label": "Target",
    "probability_threshold": 0.5,
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        },
        "shap": {
            "num_clusters": 1
        },
        "pdp": {
            "top_k_features": 2,
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "probability": 1
    }
}
```

Instead of deploying the model to an endpoint, you can provide the name of your SageMaker AI model to the SageMaker Clarify processing job using the `model_name` parameter. The following example shows how to specify a model named **your\$1model**. The SageMaker Clarify processing job will create a shadow endpoint using the configuration.

```
{
     ...
    "predictor": {
        "model_name": "your_model",
        "initial_instance_count": 1,
        "instance_type": "ml.m5.large",
        "probability": 1
    }
}
```

### Analysis configuration for a JSON Lines dataset
<a name="clarify-analysis-configure-JSONLines-example"></a>

The following examples show how to configure bias analysis and explainability analysis for a tabular dataset in JSON Lines format. In these examples, the incoming dataset has the same data as the previous section but they are in the SageMaker AI JSON Lines dense format. Each line is a valid JSON object. The key "Features" points to an array of feature values, and the key "Label" points to the ground truth label. The dataset is provided to the SageMaker Clarify job by the "dataset" processing input. For more information about JSON Lines, see [JSONLINES request format](cdf-inference.md#cm-jsonlines).

```
{"Features":[25,0,2850,2],"Label":0}
{"Features":[36,0,6585,0],"Label":1}
{"Features":[22,1,1759,1],"Label":1}
{"Features":[48,0,3446,1],"Label":0}
...
```

The following sections show how to compute pre-training and post-training bias metrics, SHAP values, and partial dependence plots (PDPs) showing feature importance for a dataset in JSON Lines format.

#### Compute pre-training bias metrics
<a name="clarify-analysis-configure-JSONLines-pretraining"></a>

Specify the label, features, format, and methods to measure pre-training bias metrics for a `Gender` value of `0`. In the following example, the `headers` parameter provides the feature names first. The label name is provided last. By convention, the last header is the label header. 

The `features` parameter is set to the JMESPath expression "Features" so that the SageMaker Clarify processing job can extract the array of features from each record. The `label` parameter is set to JMESPath expression "Label" so that the SageMaker Clarify processing job can extract the ground truth label from each record. Use a facet name to specify the sensitive attribute, as follows.

```
{
    "dataset_type": "application/jsonlines",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "Label",
    "features": "Features",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        }
    }
}
```

#### Compute all the bias metrics
<a name="clarify-analysis-configure-JSONLines-bias"></a>

You must have a trained model to compute post-training bias metrics. The following example is from a binary classification model that outputs JSON Lines data in the example's format. Each row of the model output is a valid JSON object. The key `predicted_label` points to the predicted label, and the key `probability` points to the probability value.

```
{"predicted_label":0,"probability":0.028986845165491}
{"predicted_label":1,"probability":0.825382471084594}
...
```

You can deploy the model to a SageMaker AI endpoint named `your_endpoint`. The following example analysis configuration instructs the SageMaker Clarify processing job to compute all possible bias metrics for both the dataset and the model. In this example, the parameter `content_type` and `accept_type` are not set. Therefore, they are automatically set to use the value of the parameter dataset\$1type, which is `application/jsonlines`. The SageMaker Clarify processing job uses the `content_template` parameter to compose the model input, by replacing the `$features` placeholder by an array of features.

```
{
    "dataset_type": "application/jsonlines",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "Label",
    "features": "Features",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "{\"Features\":$features}",
        "label": "predicted_label"
    }
}
```

#### Compute the SHAP values
<a name="clarify-analysis-configure-JSONLines-shap"></a>

Because SHAP analysis doesn’t need a ground truth label, the `label` parameter is omitted. In this example, the `headers` parameter is also omitted. Therefore, the SageMaker Clarify processing job must generate placeholders using generic names like `column_0` or `column_1` for feature headers, and `label0` for a label header. You can specify values for `headers` and for a `label` to improve the readability of the analysis result. Because the probability parameter is set to JMESPath expression `probability`, the probability value will be extracted from the model output. The following is an example to calculate SHAP values.

```
{
    "dataset_type": "application/jsonlines",
    "features": "Features",
    "methods": {
        "shap": {
            "num_clusters": 1
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "{\"Features\":$features}",
        "probability": "probability"
    }
}
```

#### Compute partials dependence plots (PDPs)
<a name="clarify-analysis-configure-JSONLines-pdp"></a>

The following example shows how to view the importance of "Income" on PDP. In this example, the feature headers are not provided. Therefore, the `features` parameter of the `pdp` method must use zero-based index to refer to location of the feature column. The `grid_resolution` parameter divides the range of the feature values into `10` buckets. Together, the parameters in the example instruct the SageMaker Clarify processing job to generate a report containing a PDP graph for `Income` with `10` segments on the x-axis. The y-axis will show the marginal impact of `Income` on the predictions.

```
{
    "dataset_type": "application/jsonlines",
    "features": "Features",
    "methods": {
        "pdp": {
            "features": [2],
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "{\"Features\":$features}",
        "probability": "probability"
    }
}
```

#### Compute both bias metrics and feature importance
<a name="clarify-analysis-configure-JSONLines-fi-metrics"></a>

You can combine all previous methods into a single analysis configuration file and compute them all by a single job. The following example shows an analysis configuration with all steps combined. In this example, the `probability` parameter is set. But because bias analysis needs a predicted label, the `probability_threshold` parameter is set to `0.5` to convert the probability score into a binary label. In this example, the `top_k_features` parameter of the `pdp` method is set to `2`. This instructs the SageMaker Clarify processing job to compute PDPs for the top `2` features with the largest global SHAP values.

```
{
    "dataset_type": "application/jsonlines",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "Label",
    "features": "Features",
    "probability_threshold": 0.5,
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        },
        "shap": {
            "num_clusters": 1
        },
        "pdp": {
            "top_k_features": 2,
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "{\"Features\":$features}",
        "probability": "probability"
    }
}
```

### Analysis configuration for a JSON dataset
<a name="clarify-analysis-configure-JSON-example"></a>

The following examples show how to configure bias and explainability analysis for a tabular dataset in JSON format. In these examples, the incoming dataset has the same data as the previous section but they are in the SageMaker AI JSON dense format. For more information about JSON Lines, see [JSONLINES request format](cdf-inference.md#cm-jsonlines).

The whole input request is valid JSON where the outer structure is a list and each element is the data for a record. Within each record, the key `Features` points to an array of feature values, and the key `Label` points to the ground truth label. The dataset is provided to the SageMaker Clarify job by the `dataset` processing input.

```
[
    {"Features":[25,0,2850,2],"Label":0},
    {"Features":[36,0,6585,0],"Label":1},
    {"Features":[22,1,1759,1],"Label":1},
    {"Features":[48,0,3446,1],"Label":0},
    ...
]
```

The following sections show how to compute pre-training and post-training bias metrics, SHAP values, and partial dependence plots (PDPs) that show feature importance for a dataset in JSON Lines format.

#### Compute pre-training bias metrics
<a name="clarify-analysis-configure-JSON-example-pretraining"></a>

Specify the label, features, format, and methods to measure pre-training bias metrics for a `Gender` value of `0`. In the following example, the `headers` parameter provides the feature names first. The label name is provided last. For JSON datasets, the last header is the label header.

The `features` parameter is set to the JMESPath expression that extracts a 2D array or matrix. Each row in this matrix must contain the list of `Features` for each record. The `label` parameter is set to JMESPath expression that extracts a list of ground truth labels. Each element in this list must contain the label for a record. 

Use a facet name to specify the sensitive attribute, as follows.

```
{
    "dataset_type": "application/json",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "[*].Label",
    "features": "[*].Features",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        }
    }
}
```

#### Compute all the bias metrics
<a name="clarify-analysis-configure-JSON-example-bias"></a>

You must have a trained model to compute post-training bias metrics. The following code example is from a binary classification model that outputs JSON data in the example's format. In the example, each element under `predictions` is the prediction output for a record. The example code contains the key `predicted_label`, that points to the predicted label, and the key `probability` points to the probability value.

```
{
    "predictions": [
        {"predicted_label":0,"probability":0.028986845165491},
        {"predicted_label":1,"probability":0.825382471084594},
        ...
    ]
}
```

You can deploy the model to a SageMaker AI endpoint named `your_endpoint`. 

In the following example, the parameter `content_type` and `accept_type` are not set. Therefore, `content_type` and `accept_type` are automatically set to use the value of the parameter `dataset_type`, which is `application/json`. The SageMaker Clarify processing job then uses the `content_template` parameter to compose the model input. 

In the following example, the model input is composed by replacing the `$records` placeholder by an array of records. Then, the `record_template` parameter composes each record’s JSON structure and replaces the `$features` placeholder with each record’s array of features.

The following example analysis configuration instructs the SageMaker Clarify processing job to compute all possible bias metrics for both the dataset and the model.

```
{
    "dataset_type": "application/json",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "[*].Label",
    "features": "[*].Features",
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "$records",
        "record_template": "{\"Features\":$features}",
        "label": "predictions[*].predicted_label"
    }
}
```

#### Compute the SHAP values
<a name="clarify-analysis-configure-JSON-example-shap"></a>

You don’t need to specify a label for SHAP analysis. In the following example, the `headers` parameter is not specified. Therefore, the SageMaker Clarify processing job will generate placeholders using generic names like `column_0` or `column_1` for feature headers, and `label0` for a label header. You can specify values for `headers` and for a `label` to improve the readability of the analysis result. 

In the following configuration example, the probability parameter is set to a JMESPath expression that extracts the probabilities from each prediction for each record. The following is an example to calculate SHAP values.

```
{
    "dataset_type": "application/json",
    "features": "[*].Features",
    "methods": {
        "shap": {
            "num_clusters": 1
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "$records",
        "record_template": "{\"Features\":$features}",
        "probability": "predictions[*].probability"
    }
}
```

#### Compute partial dependence plots (PDPs)
<a name="clarify-analysis-configure-JSON-example-pdp"></a>

The following example shows you how to view a feature importance in PDPs. In the example, the feature headers are not provided. Therefore, the `features` parameter of the `pdp` method must use zero-based index to refer to location of the feature column. The `grid_resolution` parameter divides the range of the feature values into `10` buckets. 

Together, the parameters in the following example instruct the SageMaker Clarify processing job to generate a report containing a PDP graph for `Income` with `10` segments on the x-axis. The y-axis shows the marginal impact of `Income` on the predictions.

The following configuration example shows how to view the importance of `Income` on PDPs.

```
{
    "dataset_type": "application/json",
    "features": "[*].Features",
    "methods": {
        "pdp": {
            "features": [2],
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "$records",
        "record_template": "{\"Features\":$features}",
        "probability": "predictions[*].probability"
    }
}
```

#### Compute both bias metrics and feature importance
<a name="clarify-analysis-configure-JSON-example-bias-fi"></a>

You can combine all previous configuration methods into a single analysis configuration file and compute them all with a single job. The following example shows an analysis configuration with all steps combined. 

In this example, the `probability` parameter is set. Because bias analysis needs a predicted label, the `probability_threshold` parameter is set to `0.5`, which is used to convert the probability score into a binary label. In this example, the `top_k_features` parameter of the `pdp` method is set to `2`. This instructs the SageMaker Clarify processing job to compute PDPs for the top `2` features with the largest global SHAP values.

```
{
    "dataset_type": "application/json",
    "headers": ["Age","Gender","Income","Occupation","Target"],
    "label": "[*].Label",
    "features": "[*].Features",
    "probability_threshold": 0.5,
    "label_values_or_threshold": [1],
    "facet": [
        {
            "name_or_index": "Gender",
            "value_or_threshold": [0]
        }
    ],
    "methods": {
        "pre_training_bias": {
            "methods": "all"
        },
        "post_training_bias": {
            "methods": "all"
        },
        "shap": {
            "num_clusters": 1
        },
        "pdp": {
            "top_k_features": 2,
            "grid_resolution": 10
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "endpoint_name": "your_endpoint",
        "content_template": "$records",
        "record_template": "{\"Features\":$features}",
        "probability": "predictions[*].probability"
    }
}
```

### Analysis configuration for natural language processing explainability
<a name="clarify-analysis-configure-nlp-example"></a>

The following example shows an analysis configuration file for computing feature importance for natural language processing (NLP). In this example, the incoming dataset is a tabular dataset in CSV format, with one binary label column and two feature columns, as follows. The dataset is provided to the SageMaker Clarify job by the `dataset` processing input parameter.

```
0,2,"They taste gross"
1,3,"Flavor needs work"
1,5,"Taste is awful"
0,1,"The worst"
...
```

In this example, a binary classification model was trained on the previous dataset. The model accepts CSV data, and it outputs a single score between `0` and `1`, as follows.

```
0.491656005382537
0.569582343101501
...
```

The model is used to create a SageMaker AI model named “your\$1model". The following analysis configuration shows how to run a token-wise explainability analysis using the model and dataset. The `text_config` parameter activates the NLP explainability analysis. The `granularity` parameter indicates that the analysis should parse tokens. 

In English, each token is a word. The following example also shows how to provide an in-place SHAP "baseline" instance using an average "Rating" of 4. A special mask token "[MASK]" is used to replace a token (word) in "Comments". This example also uses a GPU endpoint instance type to speed up inferencing.

```
{
    "dataset_type": "text/csv",
    "headers": ["Target","Rating","Comments"]
    "label": "Target",
    "methods": {
        "shap": {
            "text_config": {
                "granularity": "token",
                "language": "english"
            }
            "baseline": [[4,"[MASK]"]],
        }
    },
    "predictor": {
        "model_name": "your_nlp_model",
        "initial_instance_count": 1,
        "instance_type": "ml.g4dn.xlarge"
    }
}
```

### Analysis configuration for computer vision explainability
<a name="clarify-analysis-configure-computer-vision-example"></a>

The following example shows an analysis configuration file computing feature importance for computer vision. In this example, the input dataset consists of JPEG images. The dataset is provided to the SageMaker Clarify job by the `dataset` processing input parameter. The example shows how to configure an explainability analysis using a SageMaker image classification model. In the example, a model named `your_cv_ic_model`, has been trained to classify the animals on the input JPEG images.

```
{
    "dataset_type": "application/x-image",
    "methods": {
        "shap": {
             "image_config": {
                "model_type": "IMAGE_CLASSIFICATION",
                 "num_segments": 20,
                "segment_compactness": 10
             }
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "model_name": "your_cv_ic_model",
        "initial_instance_count": 1,
        "instance_type": "ml.p2.xlarge",
        "label_headers": ["bird","cat","dog"]
    }
}
```

For more information about image classification, see [Image Classification - MXNet](image-classification.md).

In this example, a [SageMaker AI object detection model](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html), `your_cv_od_model` is trained on the same JPEG images to identify the animals on them. The following example shows how to configure an explainability analysis for the object detection model.

```
{
    "dataset_type": "application/x-image",
    "probability_threshold": 0.5,
    "methods": {
        "shap": {
             "image_config": {
                "model_type": "OBJECT_DETECTION",
                 "max_objects": 3,
                "context": 1.0,
                "iou_threshold": 0.5,
                 "num_segments": 20,
                "segment_compactness": 10
             }
        },
        "report": {
            "name": "report"
        }
    },
    "predictor": {
        "model_name": "your_cv_od_model",
        "initial_instance_count": 1,
        "instance_type": "ml.p2.xlarge",
        "label_headers": ["bird","cat","dog"]
    }
}
```

### Analysis configuration for time series forecast model explainability
<a name="clarify-analysis-configure-time-series-example"></a>

The following example shows an analysis configuration file for computing feature importance for a time series (TS). In this example, the incoming dataset is a time series dataset in JSON format with a set of dynamic and static covariate features. The dataset is provided to the SageMaker Clarify job by the dataset processing input parameter `dataset_uri`.

```
[
    {
        "item_id": "item1",
        "timestamp": "2019-09-11",
        "target_value": 47650.3,
        "dynamic_feature_1": 0.4576,
        "dynamic_feature_2": 0.2164,
        "dynamic_feature_3": 0.1906,
        "static_feature_1": 3,
        "static_feature_2": 4
    },
    {
        "item_id": "item1",
        "timestamp": "2019-09-12",
        "target_value": 47380.3,
        "dynamic_feature_1": 0.4839,
        "dynamic_feature_2": 0.2274,
        "dynamic_feature_3": 0.1889,
        "static_feature_1": 3,
        "static_feature_2": 4
    },
    {
        "item_id": "item2",
        "timestamp": "2020-04-23",
        "target_value": 35601.4,
        "dynamic_feature_1": 0.5264,
        "dynamic_feature_2": 0.3838,
        "dynamic_feature_3": 0.4604,
        "static_feature_1": 1,
        "static_feature_2": 2
    },
]
```

The following sections explain how to compute feature attributions for a forecasting model with the asymmetric Shapley values algorithm for a JSON dataset. 

#### Compute the explanations for time series forecasting models
<a name="clarify-processing-job-configure-analysis-feature-attr"></a>

The following example analysis configuration displays the options used by the job to compute the explanations for time series forecasting models.

```
{
    'dataset_type': 'application/json',
    'dataset_uri': 'DATASET_URI',
    'methods': {
        'asymmetric_shapley_value': {
            'baseline': {
                "related_time_series": "zero",
                "static_covariates": {
                    "item1": [0, 0], "item2": [0, 0]
                },
                "target_time_series": "zero"
            },
            'direction': 'chronological',
            'granularity': 'fine_grained',
            'num_samples': 10
        },
        'report': {'name': 'report', 'title': 'Analysis Report'}
    },
    'predictor': {
        'accept_type': 'application/json',
        'content_template': '{"instances": $records}',
        'endpoint_name': 'ENDPOINT_NAME', 
        'content_type': 'application/json',              
        'record_template': '{
            "start": $start_time, 
            "target": $target_time_series, 
            "dynamic_feat": $related_time_series, 
            "cat": $static_covariates
        }',
        'time_series_predictor_config': {'forecast': 'predictions[*].mean[:2]'}
    },
    'time_series_data_config': {
        'dataset_format': 'timestamp_records',
        'item_id': '[].item_id',
        'related_time_series': ['[].dynamic_feature_1', '[].dynamic_feature_2', '[].dynamic_feature_3'],
        'static_covariates': ['[].static_feature_1', '[].static_feature_2'],
        'target_time_series': '[].target_value',
        'timestamp': '[].timestamp'
    }
}
```

##### Time series explainability configuration
<a name="clarify-processing-job-configure-analysis-feature-attr-tsconfig"></a>

The preceding example uses `asymmetric_shapley_value` in `methods` to define the time series explainability arguments like baseline, direction, granularity, and number of samples. The baseline values are set for all three types of data: related time series, static covariates, and target time series. These fields instruct the SageMaker Clarify processor to compute feature attributions for one item at a time.

##### Predictor configuration
<a name="clarify-processing-job-configure-analysis-feature-attr-predictconfig"></a>

You can fully control the payload structure that the SageMaker Clarify processor sends using JMESPath syntax. In the preceding example, the `predictor` config instructs Clarify to aggregate records into `'{"instances": $records}'` , where each record is defined with the arguments given for `record_template` in the example. Note that `$start_time`, `$target_time_series`, `$related_time_series`, and `$static_covariates` are internal tokens used to map dataset values to endpoint request values.

Similarly, the attribute `forecast` in `time_series_predictor_config` is used to extract the model forecast from the endpoint response. For example, your endpoint batch response could be the following:

```
{
    "predictions": [
        {"mean": [13.4, 3.6, 1.0]}, 
        {"mean": [23.0, 4.7, 3.0]}, 
        {"mean": [3.4, 5.6, 2.0]}
    ]
}
```

Suppose you specify the following time series predictor configuration:

```
'time_series_predictor_config': {'forecast': 'predictions[*].mean[:2]'}
```

The forecast value is parsed as the following:

```
[
    [13.4, 3.6],
    [23.0, 4.7],
    [3.4, 5.6]
]
```

##### Data configuration
<a name="clarify-processing-job-configure-analysis-feature-attr-dataconfig"></a>

Use the `time_series_data_config` attribute to instruct the SageMaker Clarify processor to parse data correctly from the data passed as an S3 URI in `dataset_uri`. 

# Data Format Compatibility Guide
<a name="clarify-processing-job-data-format"></a>

This guide describes the data format types that are compatible with SageMaker Clarify processing jobs. The supported data format types include the file extensions, data structure, and specific requirements or restrictions for tabular, image, and time series datasets. This guide also shows how to check if your dataset conforms to these requirements.

At a high level, the SageMaker Clarify processing job follows the input–process–output model to compute bias metrics and feature attributions. Refer to the following examples for details.

The input to the SageMaker Clarify processing job consists of the following:
+ The dataset to be analyzed.
+ The analysis configuration. For more information about how to configure an analysis, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

During the processing stage, SageMaker Clarify computes bias metrics and feature attributions. The SageMaker Clarify processing job completes the following steps in the backend:
+ The SageMaker Clarify processing job parses your analysis configuration and loads your **dataset**.
+ To compute post-training bias metrics and feature attributions, the job requires model predictions from your model. The SageMaker Clarify processing job serializes your data and sends it as a **request **to your model that is deployed on a SageMaker AI real-time inference **endpoint**. After that, the SageMaker Clarify processing job extracts predictions from the** response**.
+ The SageMaker Clarify processing job performs the bias and explainability analysis, and then it outputs the results.

For more information, see [How SageMaker Clarify Processing Jobs Work](clarify-configure-processing-jobs.md#clarify-processing-job-configure-how-it-works) .

The parameter that' you use to specify the format of the data depends on where the data is used in the processing flow as follows:
+ For an **input dataset**, use the `dataset_type` parameter to specify the format or MIME type.
+ For a **request** to an endpoint, use the `content_type` parameter to specify the format.
+ For a **response** from an endpoint, use the `accept_type` parameter to specify the format.

The input dataset, request, and the response to and from the endpoint don't require the same format. For example, you can use a Parquet dataset with a CSV **request** payload and a JSON Lines **response** payload given the following conditions.
+ Your analysis is configured correctly.
+ Your model supports the request and response formats.

**Note**  
If `content_type` or `accept_type` are not provided, then the SageMaker Clarify container infers the `content_type` and `accept_type`.

**Topics**
+ [Tabular data](clarify-processing-job-data-format-tabular.md)
+ [Image data requirements](clarify-processing-job-data-format-image.md)
+ [Time series data](clarify-processing-job-data-format-time-series.md)

# Tabular data
<a name="clarify-processing-job-data-format-tabular"></a>

Tabular data refers to data that can be loaded into a two-dimensional data frame. In the frame, each row represents a record, and each record has one or more columns. The values within each data frame cell can be of numerical, categorical, or text data types.

## Tabular dataset prerequisites
<a name="clarify-processing-job-data-format-tabular-prereq"></a>

Prior to analysis, your dataset should have had any necessary pre-processing steps already applied. This includes data cleaning or feature engineering.

You can provide one or multiple datasets. If you provide multiple datasets, use the following to identify them to the SageMaker Clarify processing job.
+ Use either a [ProcessingInput](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html) named `dataset` or the analysis configuration `dataset_uri` to specify the main dataset. For more information about `dataset_uri`, see the parameters list in [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).
+ Use the `baseline` parameter provided in the analysis configuration file. The baseline dataset is required for SHAP analysis. For more information about the analysis configuration file, including examples, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The following table lists supported data formats, their file extensions, and MIME types.


| Data format | File extension | MIME type | 
| --- | --- | --- | 
|  CSV  |  csv  |  `text/csv`  | 
|  JSON Lines  |  jsonl  |  `application/jsonlines`  | 
|  JSON  |  json  |  `application/json`  | 
|  Parquet  |  parquet  |  "application/x-parquet"  | 

The following sections show example tabular datasets in CSV, JSON Lines, and Apache Parquet formats.

### Tabular dataset prerequisites in CSV format
<a name="clarify-processing-job-data-format-tabular-prereq-csv"></a>

The SageMaker Clarify processing job is designed to load CSV data files in the [csv.excel](https://docs.python.org/3/library/csv.html#csv.excel) dialect. However, it's flexible enough to support other line terminators, including `\n` and `\r`.

For compatibility, all CSV data files provided to the SageMaker Clarify processing job must be encoded in UTF-8.

If your dataset does not contain a header row, do the following:
+ Set the analysis configuration label to index `0`. This means that the first column is the ground truth label.
+ If the parameter `headers` is set, set `label` to the label column header to indicate the location of the label column. All other columns are designated as features.

  The following is an example of a dataset that does not contain a header row.

  ```
  1,5,2.8,2.538,This is a good product
  0,1,0.79,0.475,Bad shopping experience
  ...
  ```

If your data contains a header row, set the parameter `label` to index `0`. To indicate the location of the label column, use the ground truth label header `Label`. All other columns are designated as features.

The following is an example of a dataset that contains a header row.

```
Label,Rating,A12,A13,Comments
1,5,2.8,2.538,This is a good product
0,1,0.79,0.475,Bad shopping experience
...
```

### Tabular dataset prerequisites in JSON format
<a name="clarify-processing-job-data-format-tabular-prereq-json"></a>

JSON is a flexible format for representing structured data that contains any level of complexity. The SageMaker Clarify support for JSON is not restricted to any specific format and thus allows for more flexible data formats in comparison to datasets in CSV or JSON Lines formats. This guide shows you how to set an analysis configuration for tabular data in JSON format. 

**Note**  
To ensure compatibility, all JSON data files provided to the SageMaker Clarify processing job must be encoded in UTF-8.

The following is example input data with records that contain a top-level key, a list of features, and a label.

```
[
    {"features":[1,5,2.8,2.538,"This is a good product"],"label":1},
    {"features":[0,1,0.79,0.475,"Bad shopping experience"],"label":0},
    ...
]
```

An example configuration analysis for the previous input example dataset should set the following parameters:
+ The `label` parameter should use the [JMESPath](https://jmespath.org/) expression `[*].label` to extract the ground truth label for each record in the dataset. The JMESPath expression should produce a list of labels where the ith label corresponds to the ith record.
+ The `features` parameter should use the JMESPath expression `[*].features` to extract an array of features for each record in the dataset. The JMESPath expression should produce a 2D array or matrix where the ith row contains the feature values for corresponding to the ith record.

  The following is example input data with records that contains a top-level key and a nested key that contains a list of features and labels for each record.

```
{
    "data": [
        {"features":[1,5,2.8,2.538,"This is a good product"],"label":1}},
        {"features":[0,1,0.79,0.475,"Bad shopping experience"],"label":0}}
    ]
}
```

An example configuration analysis for the previous input example dataset should set the following parameters:
+ The `label` parameter uses the [JMESPath](https://jmespath.org/) expression `data[*].label` to extract the ground truth label for each record in the dataset. The JMESPath expression should produce a list of labels where the ith label is for the ith record.
+ The `features` parameter uses the JMESPath expression `data[*].features` to extract the array of features, for each record in the dataset. The JMESPath expression should produce a 2D array or matrix where the ith row contains the feature values for the ith record.

### Tabular dataset prerequisites in JSON Lines format
<a name="clarify-processing-job-data-format-tabular-prereq-jsonlines"></a>

JSON Lines is a text format for representing structured data where each line is a valid JSON object. Currently SageMaker Clarify processing jobs only support SageMaker AI Dense Format JSON Lines. To conform to the required format, all of the features of a record should be listed in a single JSON array. For more information about JSON Lines, see [JSONLINES request format](cdf-inference.md#cm-jsonlines).

**Note**  
All JSON Lines data files provided to the SageMaker Clarify processing job must be encoded in UTF-8 to ensure compatibility.

The following is an example of how to set an analysis configuration for a record that contains a **top-level key** and a **list** of elements. 

```
{"features":[1,5,2.8,2.538,"This is a good product"],"label":1}
{"features":[0,1,0.79,0.475,"Bad shopping experience"],"label":0}
...
```

The configuration analysis for the previous dataset example should set the parameters as follows:
+ To indicate the location of the ground truth label, the parameter `label` should be set to the JMESPath expression `label`.
+ To indicate the location of the array of features, the parameter `features` should be set to the JMESPath expression `features`.

The following is an example of how to set an analysis configuration for a record that contains a **top-level key** and a **nested key** that contains a **list** of elements. 

```
{"data":{"features":[1,5,2.8,2.538,"This is a good product"],"label":1}}
{"data":{"features":[0,1,0.79,0.475,"Bad shopping experience"],"label":0}}
...
```

The configuration analysis for the previous dataset example should set the parameters as follows:
+ The parameter `label` should be set to the JMESPath expression `data.label` to indicate the location of the ground truth label.
+ The parameter `features` should be set to the JMESPath expression `data.features` to indicate the location of the array of features.

### Tabular dataset prerequisites in Parquet format
<a name="clarify-processing-job-data-format-tabular-prereq-parquet"></a>

[Parquet](https://parquet.apache.org/) is a column-oriented binary data format. Currently, SageMaker Clarify processing jobs support loading Parquet data files only when the processing instance count is `1`.

Because SageMaker Clarify processing jobs don’t support endpoint request or endpoint response in Parquet format, you must specify the data format of the endpoint request by setting the analysis configuration parameter `content_type` to a supported format. For more information, see `content_type` in [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The Parquet data must have column names that are formatted as strings. Use the analysis configuration `label` parameter to set the label column name to indicate the location of the ground truth labels. All other columns are designated as features.

# Endpoint requests for tabular data
<a name="clarify-processing-job-data-format-tabular-request"></a>

To obtain model predictions for post-training bias analysis and feature importance analysis, SageMaker Clarify processing jobs serialize the tabular data into bytes and sends these to an inference endpoint as a request payload. This tabular data is either sourced from the input dataset, or it's generated. If it's synthetic data, it's generated by the explainer for SHAP analysis or PDP analysis.

The data format of the request payload should be specified by the analysis configuration `content_type` parameter. If the parameter is not provided, the SageMaker Clarify processing job will use the value of the `dataset_type` parameter as the content type. For more information about `content_type` or `dataset_type`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The following sections show example endpoint requests in CSV and JSON Lines formats.

## Endpoint request in CSV format
<a name="clarify-processing-job-data-format-tabular-request-csv"></a>

The SageMaker Clarify processing job can serialize data to CSV format (MIME type: `text/csv`). The following table shows examples of the serialized request payloads.


| Endpoint request payload (string representation) | Comments | 
| --- | --- | 
|  '1,2,3,4'  |  Single record (four numerical features).  | 
|  '1,2,3,4\$1n5,6,7,8'  |  Two records, separated by line break '\$1n'.  | 
|  '"This is a good product",5'  |  Single record (a text feature and a numerical feature).  | 
|  ‘"This is a good product",5\$1n"Bad shopping experience",1’  |  Two records.  | 

## Endpoint request is in JSON Lines format
<a name="clarify-processing-job-data-format-tabular-request-jsonlines"></a>

The SageMaker Clarify processing job can serialize data to SageMaker AI JSON Lines dense format (MIME type: `application/jsonlines`). For more information about JSON Lines, see [JSONLINES request format](cdf-inference.md#cm-jsonlines).

To transform tabular data into JSON data, provide a template string to the analysis configuration `content_template` parameter. For more information about `content_template` see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md). The following table shows examples of serialized JSON Lines request payloads.


| Endpoint request payload (string representation) | Comments | 
| --- | --- | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1'  |  Single record. In this case, the template looks like `'{"data":{"features":$features}}' `and `$features` is replaced by the list of features `[1,2,3,4]`.  | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1\$1n\$1"data":\$1"features":[5,6,7,8]\$1\$1'  |  Two records.  | 
|  '\$1"features":["This is a good product",5]\$1'  |  Single record. In this case, the template looks like `'{"features":$features}'` and \$1features is replaced by the list of features `["This is a good product",5]`.  | 
|  '\$1"features":["This is a good product",5]\$1\$1n\$1"features":["Bad shopping experience",1]\$1'  |  Two records.  | 

## Endpoint request is in JSON format
<a name="clarify-processing-job-data-format-tabular-request-json"></a>

A SageMaker Clarify processing job can serialize data to arbitrary JSON structures (MIME type: `application/json`). To do this, you must provide a template string to the analysis configuration `content_template` parameter. This is used by the SageMaker Clarify processing job to construct the outer JSON structure. You must also provide a template string for `record_template`, which is used to construct the JSON structure for each record. For more information about `content_template` and `record_template`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md). 

**Note**  
Because `content_template` and `record_template` are string parameters, any double quote characters (`"`) that are part of the JSON serialized structure should be noted as an escaped character in your configuration. For example, if you want to escape a double quote in Python, you could enter the following for `content_template`.  

```
"{\"data\":{\"features\":$record}}}"
```

The following table shows examples of serialized JSON request payloads and the corresponding `content_template` and `record_template` parameters that are required to construct them.


| Endpoint request payload (string representation) | Comments | content\$1template | record\$1template | 
| --- | --- | --- | --- | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1'  |  Single record at a time.  |  '\$1"data":\$1"features":\$1record\$1\$1\$1'  |  “\$1features”  | 
|  '\$1"instances":[[0, 1], [3, 4]], "feature-names": ["A", "B"]\$1'  |  Multi-records with feature names.  |  ‘\$1"instances":\$1records, "feature-names":\$1feature\$1names\$1'  |  “\$1features"  | 
|  '[\$1"A": 0, "B": 1\$1, \$1"A": 3, "B": 4\$1]'  |  Multi-records and key-value pairs.  |  “\$1records"  |  “\$1features\$1kvp"  | 
|  ‘\$1"A": 0, "B": 1\$1'  |  Single record at a time and key-value pairs.  |  "\$1record"  |  "\$1features\$1kvp"  | 
|  ‘\$1"A": 0, "nested": \$1"B": 1\$1\$1'  |  Alternatively, use the fully verbose record\$1template for arbitrary structures.  |  "\$1record"  |  '\$1"A": "\$1\$1A\$1", "nested": \$1"B": "\$1\$1B\$1"\$1\$1'  | 

# Endpoint response for tabular data
<a name="clarify-processing-job-data-format-tabular-response"></a>

After the SageMaker Clarify processing job receives an inference endpoint invocation's response, it deserializes the response payload and extracts predictions from it. Use the analysis configuration `accept_type` parameter to specify the data format of the response payload. If `accept_type` is not provided, the SageMaker Clarify processing job will use the value of the content\$1type parameter as the model output format. For more information about `accept_type`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The predictions could either consist of predicted labels for bias analysis, or probability values (scores) for feature importance analysis. In the `predictor` analysis configuration, the following three parameters extract the predictions.
+ The parameter `probability` is used to locate the probability values (scores) in the endpoint response.
+ The parameter `label` is used to locate the predicted labels in the endpoint response.
+ (Optional) The parameter `label_headers` provides the predicted labels for a multiclass model.

The following guidelines pertain to endpoint responses in CSV, JSON Lines, and JSON formats.

## Endpoint Response is in CSV format
<a name="clarify-processing-job-data-format-tabular-reponse-csv"></a>

If the response payload is in CSV format (MIME type: `text/csv`), the SageMaker Clarify processing job deserializes each row. It then extracts the predictions from the deserialized data using the column indexes provided in the analysis configuration. The rows in the response payload must match the records in the request payload. 

The following tables provide examples of response data in different formats and for different problem types. Your data can vary from these examples, as long as the predictions can be extracted according to the analysis configuration.

The following sections show example endpoint responses in CSV formats.

### Endpoint response is in CSV format and contains probability only
<a name="clarify-processing-job-data-format-tabular-reponse-csv-prob"></a>

The following table is an example endpoint response for regression and binary classification problems.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record.  |  '0.6'  | 
|  Two records (results in one line, divided by comma).  |  '0.6,0.3'  | 
|  Two records (results in two lines).  |  '0.6\$1n0.3'  | 

For the previous example, the endpoint outputs a single probability value (score) of the predicted label. To extract probabilities using the index and use them for feature importance analysis, set the analysis configuration parameter `probability` to column index `0`. These probabilities can also be used for bias analysis if they're converted to binary value by using the `probability_threshold` parameter. For more information about `probability_threshold`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The following table is an example endpoint response for a multiclass problem.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record of a multiclass model (three classes).  |  '0.1,0.6,0.3'  | 
|  Two records of a multiclass model (three classes).  |  '0.1,0.6,0.3\$1n0.2,0.5,0.3'  | 

For the previous example, the endpoint outputs a list of probabilities (scores). If no index is provided, all values are extracted and used for feature importance analysis. If the analysis configuration parameter `label_headers` is provided. Then the SageMaker Clarify processing job can select the label header of the max probability as the predicted label, which can be used for bias analysis. For more information about `label_headers`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

### Endpoint response is in CSV format and contains predicted label only
<a name="clarify-processing-job-data-format-tabular-reponse-csv-pred"></a>

The following table is an example endpoint response for regression and binary classification problems.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '1'  | 
|  Two records (results in one line, divided by comma)  |  '1,0'  | 
|  Two records (results in two lines)  |  '1\$1n0'  | 

For the previous example, the endpoint outputs the predicted label instead of probability. Set the `label` parameter of the `predictor` configuration to column index `0` so that the predicted labels can be extracted using the index and used for bias analysis.

### Endpoint response is in CSV format and contains predicted label and probability
<a name="clarify-processing-job-data-format-tabular-reponse-csv-pred-prob"></a>

The following table is an example endpoint response for regression and binary classification problems.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '1,0.6'  | 
|  Two records  |  '1,0.6\$1n0,0.3'  | 

For the previous example, the endpoint outputs the predicted label followed by its probability. Set the `label` parameter of the `predictor` configuration to column index `0`, and set `probability` to column index `1` to extract both parameter values.

### Endpoint response is in CSV format and contains predicted labels and probabilities (multiclass)
<a name="clarify-processing-job-data-format-tabular-reponse-csv-preds-probs"></a>

A multiclass model trained by Amazon SageMaker Autopilot can be configured to output the string representation of the list of predicted labels and probabilities . The following example table shows an example endpoint response from a model that is configured to output `predicted_label`, `probability`, `labels`, and `probabilities`.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '"dog",0.6,"[\$1'cat\$1', \$1'dog\$1', \$1'fish\$1']","[0.1, 0.6, 0.3]"'  | 
|  Two records  |  '"dog",0.6,"[\$1'cat\$1', \$1'dog\$1', \$1'fish\$1']","[0.1, 0.6, 0.3]"\$1n""cat",0.7,[\$1'cat\$1', \$1'dog\$1', \$1'fish\$1']","[0.7, 0.2, 0.1]"'  | 

For the previous example, the SageMaker Clarify processing job can be configured in the following ways to extract the predictions.

For bias analysis, the previous example can be configured as one of the following.
+ Set the `label` parameter of the `predictor` configuration to `0` to extract the predicted label.
+ Set the parameter to `2` to extract the predicted labels, and set `probability` to `3` to extract the corresponding probabilities. The SageMaker Clarify processing job can automatically determine the predicted label by identifying the label with the highest probability value. Referring to the previous example of a single record, the model predicts three labels: `cat`, `dog`, and `fish`, with corresponding probabilities of `0.1`, `0.6`, and `0.3`. Based on these probabilities, the predicted label is `dog`, as it has the highest probability value of `0.6`.
+ Set `probability` to `3` to extract the probabilities. If `label_headers` is provided, then the SageMaker Clarify processing job can automatically determine the predicted label by identifying the label header with the highest probability value.

For feature importance analysis, the previous example can be configured as follows.
+ Set `probability` to `3` extract the probabilities of all the predicted labels. Then, feature attributions will be computed for all the labels. If the customer doesn’t specify `label_headers`, then the predicted labels will be used as label headers in the analysis report.

## Endpoint Response is in JSON Lines format
<a name="clarify-processing-job-data-format-tabular-reponse-jsonlines"></a>

If the response payload is in JSON Lines format (MIME type: `application/jsonlines`), the SageMaker Clarify processing job deserializes each line as JSON. It then extracts predictions from the deserialized data using JMESPath expressions provided in analysis configuration. The lines in the response payload must match the records in the request payload. The following tables shows examples of response data in different formats. Your data can vary from these examples, as long as the predictions can be extracted according to the analysis configuration.

The following sections show example endpoint responses in JSON Lines formats.

### Endpoint response is in JSON Lines format and contains probability only
<a name="clarify-processing-job-data-format-tabular-reponse-jsonlines-prob"></a>

The following table is an example endpoint response that only outputs the probability value (score).


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"score":0.6\$1'  | 
|  Two records  |  '\$1"score":0.6\$1\$1n\$1"score":0.3\$1'  | 

For the previous example, set the analysis configuration parameter `probability` to JMESPath expression "score" to extract its value.

### Endpoint response is in JSON Lines format and contains predicted label only
<a name="clarify-processing-job-data-format-tabular-reponse-jsonlines-pred"></a>

The following table is an example endpoint response that only outputs the predicted label. 


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"prediction":1\$1'  | 
|  Two records  |  '\$1"prediction":1\$1\$1n\$1"prediction":0\$1'  | 

For the previous example, set the `label` parameter of the predictor configuration to JMESPath expression `prediction`. Then, the SageMaker Clarify processing job can extract the predicted labels for bias analysis. For more information, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

### Endpoint response is in JSON Lines format and contains predicted label and probability
<a name="clarify-processing-job-data-format-tabular-reponse-jsonlines-pred-prob"></a>

The following table is an example endpoint response that outputs the predicted label and its score.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"prediction":1,"score":0.6\$1'  | 
|  Two records  |  '\$1"prediction":1,"score":0.6\$1\$1n\$1"prediction":0,"score":0.3\$1'  | 

For the previous example, set the `label` parameter of the `predictor` configuration to JMESPath expression "prediction" to extract the predicted labels. Set `probability` to JMESPath expression "score" to extract the probability. For more information, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

### Endpoint response is in JSON Lines format and contains predicted labels and probabilities (multiclass)
<a name="clarify-processing-job-data-format-tabular-reponse-jsonlines-preds-probs"></a>

The following table is an example endpoint response from a multiclass model that outputs the following:
+ A list of predicted labels.
+  Probabilities, and the selected predicted label and its probability.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1label":"dog","probability":0.6,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1'  | 
|  Two records  |  '\$1"predicted\$1label":"dog","probability":0.6,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1\$1n\$1"predicted\$1label":"cat","probability":0.7,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.7,0.2,0.1]\$1'  | 

 For the previous example, the SageMaker Clarify processing job can be configured in several ways to extract the predictions. 

For bias analysis, the previous example can be configured as **one** of the following.
+ Set the `label` parameter of the `predictor` configuration to JMESPath expression "predicted\$1label" to extract the predicted label.
+ Set the parameter to JMESPath expression "predicted\$1labels" to extract the predicted labels. Set `probability` to JMESPath expression "probabilities" to extract their probabilities. The SageMaker Clarify job automatically determine the predicted label by identifying the label with the highest probability value.
+ Set `probability` to JMESPath expression "probabilities" to extract their probabilities. If `label_headers` is provided, then the SageMaker Clarify processing job can automatically determine the predicted label by identifying the label with the highest probability value.

For feature importance analysis, do the following.
+ Set `probability` to the JMESPath expression "probabilities" to extract their probabilities of all the predicted labels. Then, feature attributions will be computed for all the labels.

## Endpoint Response is in JSON format
<a name="clarify-processing-job-data-format-tabular-reponse-json"></a>

If the response payload is in JSON format (MIME type: `application/json`), the SageMaker Clarify processing job deserializes the entire payload as JSON. It then extracts predictions from the deserialized data using JMESPath expressions provided in the analysis configuration. The records in the response payload must match the records in the request payload. 

The following sections show example endpoint responses in JSON formats. The sections contain tables with examples of response data in different formats and for different problem types. Your data can vary from these examples, as long as the predictions can be extracted according to the analysis configuration.

### Endpoint response is in JSON format and contains probability only
<a name="clarify-processing-job-data-format-tabular-reponse-json-prob"></a>

The following table is an example response from an endpoint that only outputs the probability value (score).


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '[0.6]'  | 
|  Two records  |  '[0.6,0.3]'  | 

For the previous example, there is no line break in the response payload. Instead, a single JSON object contains a list of scores, one for each record in the request. Set the analysis configuration parameter `probability` to JMESPath expression "[\$1]" to extract the value.

### Endpoint response is in JSON format and contains predicted label only
<a name="clarify-processing-job-data-format-tabular-reponse-json-pred"></a>

The following table is an example response from an endpoint that only outputs the predicted label.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1labels":[1]\$1'  | 
|  Two records  |  '\$1"predicted\$1labels":[1,0]\$1'  | 

Set the `label` parameter of the `predictor` configuration to JMESPath expression "predicted\$1labels", and then the SageMaker Clarify processing job can extract the predicted labels for bias analysis.

### Endpoint response is JSON format and contains predicted label and probability
<a name="clarify-processing-job-data-format-tabular-reponse-json-pred-prob"></a>

The following table is an example response from an endpoint that outputs the predicted label and its score.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '\$1"predictions":[\$1"label":1,"score":0.6\$1'  | 
|  Two records  |  ‘\$1"predictions":[\$1"label":1,"score":0.6\$1,\$1"label":0,"score":0.3\$1]\$1'  | 

For the previous example, set the `label` parameter of the `predictor` configuration to the JMESPath expression "predictions[\$1].label" to extract the predicted labels. Set `probability` to JMESPath expression "predictions[\$1].score" to extract the probability. 

### Endpoint response is in JSON format and contains predicted labels and probabilities (multiclass)
<a name="clarify-processing-job-data-format-tabular-reponse-json-preds-probs"></a>

The following table is an example response from an endpoint that from a multiclass model that outputs the following:
+ A list of predicted labels.
+ Probabilities, and the selected predicted label and its probability.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single record  |  '[\$1"predicted\$1label":"dog","probability":0.6,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1]'  | 
|  Two records  |  '[\$1"predicted\$1label":"dog","probability":0.6,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1,\$1"predicted\$1label":"cat","probability":0.7,"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.7,0.2,0.1]\$1]'  | 

The SageMaker Clarify processing job can be configured in several ways to extract the predictions.

For bias analysis, the previous example can be configured as **one** of the following.
+ Set the `label` parameter of the `predictor` configuration to JMESPath expression "[\$1].predicted\$1label" to extract the predicted label.
+ Set the parameter to JMESPath expression "[\$1].predicted\$1labels" to extract the predicted labels. Set `probability` to JMESPath expression "[\$1].probabilities" to extract their probabilities. The SageMaker Clarify processing job can automatically determine the predicted label by identifying the label with the highest proximity value.
+ Set `probability` to JMESPath expression "[\$1].probabilities" to extract their probabilities. If `label_headers` is provided, then the SageMaker Clarify processing job can automatically determine the predicted label by identifying the label with the highest probability value.

For feature importance analysis, set `probability` to JMESPath expression "[\$1].probabilities" to extract their probabilities of all the predicted labels. Then, feature attributions will be computed for all the labels.

# Pre-check endpoint request and response for tabular data
<a name="clarify-processing-job-data-format-tabular-precheck"></a>

We recommend that you deploy your model to a SageMaker AI real-time inference endpoint, and send requests to the endpoint. Manually examine the requests and responses to make sure that both are compliant with the requirements in the [Endpoint requests for tabular data](clarify-processing-job-data-format-tabular-request.md) section and the [Endpoint response for tabular data](clarify-processing-job-data-format-tabular-response.md) section. If your model container supports batch requests, you can start with a single record request, and then try two or more records.

The following commands show how to request a response using the AWS CLI. The AWS CLI is pre-installed in SageMaker Studio and SageMaker Notebook instances. To install the AWS CLI, follow this [installation guide](https://aws.amazon.com/cli/).

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type $CONTENT_TYPE \
  --accept $ACCEPT_TYPE \
  --body $REQUEST_DATA \
  $CLI_BINARY_FORMAT \
  /dev/stderr 1>/dev/null
```

The parameters are defined, as follows.
+ `$ENDPOINT NAME` – The name of the endpoint.
+ `$CONTENT_TYPE` – The MIME type of the request (model container input).
+ `$ACCEPT_TYPE` – The MIME type of the response (model container output).
+ `$REQUEST_DATA` – The requested payload string.
+ `$CLI_BINARY_FORMAT` – The format of the command line interface (CLI) parameter. For AWS CLI v1, this parameter should remain blank. For v2, this parameter should be set to `--cli-binary-format raw-in-base64-out`.

**Note**  
AWS CLI v2 passes binary parameters as base64-encoded strings [by default](https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration.html#cliv2-migration-binaryparam).

# AWS CLI v1 examples
<a name="clarify-processing-job-data-format-tabular-precheck-cli-v1-examples"></a>

The example in the preceding section was for AWS CLI v2. The following request and response examples to and from the endpoint use AWS CLI v1.

## Endpoint request and response in CSV format
<a name="clarify-processing-job-data-format-tabular-precheck-csv"></a>

In the following code example, the request consists of a single record and the response is its probability value.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-sagemaker-xgboost-model \
  --content-type text/csv \
  --accept text/csv \
  --body '1,2,3,4' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
0.6
```

In the following code example, the request consists of two records, and the response includes their probabilities, which are separated by a comma.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-sagemaker-xgboost-model \
  --content-type text/csv \
  --accept text/csv \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the `$'content'` expression in the `--body` tells the command to interpret `'\n'` in the content as a line break. The response output follows.

```
0.6,0.3
```

In the following code example, the request consists of two records, the response includes their probabilities, separated with a line break.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-1 \
  --content-type text/csv \
  --accept text/csv \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
0.6
0.3
```

In the following code example, the request consists of a single record, and the response is probability values from a multiclass model containing three classes.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-1 \
  --content-type text/csv \
  --accept text/csv \
  --body '1,2,3,4' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
0.1,0.6,0.3
```

In the following code example, the request consists of two records, and the response includes their probability values from a multiclass model containing three classes.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-1 \
  --content-type text/csv \
  --accept text/csv \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
0.1,0.6,0.3
0.2,0.5,0.3
```

In the following code example, the request consists of two records, and the response includes predicted label and probability.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-2 \
  --content-type text/csv \
  --accept text/csv \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
1,0.6
0,0.3
```

In the following code example, the request consists of two records and the response includes label headers and probabilities.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-3 \
  --content-type text/csv \
  --accept text/csv \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
"['cat','dog','fish']","[0.1,0.6,0.3]"
"['cat','dog','fish']","[0.2,0.5,0.3]"
```

## Endpoint request and response in JSON Lines format
<a name="clarify-processing-job-data-format-tabular-precheck-jsonlines"></a>

In the following code example, the request consists of a single record and the response is its probability value.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-jsonlines \
  --content-type application/jsonlines \
  --accept application/jsonlines \
  --body '{"features":["This is a good product",5]}' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"score":0.6}
```

In the following code example, the request contains two records, and the response includes predicted label and probability.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-jsonlines-2 \
  --content-type application/jsonlines \
  --accept application/jsonlines \
  --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"predicted_label":1,"probability":0.6}
{"predicted_label":0,"probability":0.3}
```

In the following code example, the request contains two records, and the response includes label headers and probabilities.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-jsonlines-3 \
  --content-type application/jsonlines \
  --accept application/jsonlines \
  --body $'{"data":{"features":[1,2,3,4]}}\n{"data":{"features":[5,6,7,8]}}' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"predicted_labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]}
{"predicted_labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]}
```

## Endpoint request and response in mixed formats
<a name="clarify-processing-job-data-format-tabular-precheck-diff"></a>

In the following code example, the request is in CSV format and the response is in JSON Lines format.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-in-jsonlines-out \
  --content-type text/csv \
  --accept application/jsonlines \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"probability":0.6}
{"probability":0.3}
```

In the following code example, the request is in JSON Lines format and the response is in CSV format.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-jsonlines-in-csv-out \
  --content-type application/jsonlines \
  --accept text/csv \
  --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
0.6
0.3
```

In the following code example, the request is in CSV format and the response is in JSON format.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-csv-in-jsonlines-out \
  --content-type text/csv \
  --accept application/jsonlines \
  --body $'1,2,3,4\n5,6,7,8' \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"predictions":[{"label":1,"score":0.6},{"label":0,"score":0.3}]}
```

# Image data requirements
<a name="clarify-processing-job-data-format-image"></a>

A SageMaker Clarify processing job provides support for explaining images. This topic provides the data format requirements for image data. For information about processing the image data, see [Analyze image data for computer vision explainability](clarify-processing-job-run.md#clarify-processing-job-run-cv).

An image dataset contains one or more image files. To identify an input dataset to the SageMaker Clarify processing job, set either a [ProcessingInput](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html#sagemaker-CreateProcessingJob-request-ProcessingInputs) named `dataset` or the analysis configuration `dataset_uri` parameter to an Amazon S3 URI prefix of your image files.

The supported image file formats and file extensions are listed in the following table.


| Image format | File extension | 
| --- | --- | 
|  JPEG  |  jpg, jpeg  | 
|  PNG  |  png  | 

Set the analysis configuration `dataset_type` parameter to **application/x-image**. Because the type is not a specific image file format, the `content_type` will be used to decide the image file format and extension.

The SageMaker Clarify processing job loads each image file to a 3-dimensional [NumPy array](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) for further processing. The three dimensions include height, width, and RGB values of each pixel.

## Endpoint request format
<a name="clarify-processing-job-data-format-image-request"></a>

The SageMaker Clarify processing job converts the raw RGB data of an image into a compatible image format, such as JPEG. It does this before it sends the data to the endpoint for predictions. The supported image formats are as follows.


| Data Format | MIME type | File extension | 
| --- | --- | --- | 
|  JPEG  |  `image/jpeg`  |  jpg, jpeg  | 
|  PNG  |  `image/png`  |  png  | 
|  NPY  |  `application/x-npy`  |  All above  | 

Specify the data format of the request payload by using the analysis configuration parameter `content_type`. If the `content_type` is not provided, the data format defaults to `image/jpeg`.

## Endpoint response format
<a name="clarify-processing-job-data-format-image-response"></a>

Upon receiving the response of an inference endpoint invocation, the SageMaker Clarify processing job deserializes response payload and then extracts the predictions from it.

### Image classification problem
<a name="clarify-processing-job-data-format-image-response-class"></a>

The data format of the response payload should be specified by the analysis configuration parameter accept\$1type. If `accept_type` is not provided, the data format defaults to `application/json`. The supported formats are the same as those described in the **Endpoint response for tabular data** in the tabular data section.

See [Inference with the Image Classification Algorithm](image-classification.md#IC-inference) for an example of a SageMaker AI built-in image classification algorithm that accepts a single image and then returns an array of probability values (scores), each for a class.

As shown in the following table, when the `content_type` parameter is set to `application/jsonlines`, the response is a JSON object.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single image  |  '\$1"prediction":[0.1,0.6,0.3]\$1'  | 

In the previous example, set the `probability` parameter to JMESPath expression "prediction" to extract the scores.

When the `content_type` is set to `application/json`, the response is a JSON object, as shown in the following table.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single image  |  '[0.1,0.6,0.3]'  | 

In the previous example, set `probability` to JMESPath expression "[\$1]" to extract all the elements of the array. In the previous example, [`0.1, 0.6, 0.3]` is extracted. Alternatively, if you skip setting the `probability` configuration parameter, then all the elements of the array are also extracted. This is because the entire payload is deserialized as the predictions.

### Object detection problem
<a name="clarify-processing-job-data-format-object-response-class"></a>

The analysis configuration `accept_type` defaults to `application/json` and the only supported format is the Object Detection Inference Format. For more information about response formats, see [Response Formats](object-detection-in-formats.md#object-detection-recordio).

The following table is an example response from an endpoint that outputs an array. Each element of the array is an array of values containing the class index, the confidence score, and the bounding box coordinates of the detected object.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single image (one object)  |  '[[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244]]'  | 
|  Single image (two objects)  |  '[[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244],[0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571, 0.9712159633636475]]'  | 

The following table is an example response from an endpoint that outputs a JSON object with a key referring to the array. Set the analysis configuration `probability` to the key "prediction" to extract the values.


| Endpoint request payload | Endpoint response payload (string representation) | 
| --- | --- | 
|  Single image (one object)  |  '\$1"prediction":[[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244]]\$1'  | 
|  Single image (two objects)  |  '\$1"prediction":[[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244],[0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571, 0.9712159633636475]]\$1'  | 

## Pre-check endpoint request and response for image data
<a name="clarify-processing-job-data-format-object-precheck"></a>

We recommend that you deploy your model to a SageMaker AI real-time inference endpoint, and send requests to the endpoint. Manually examine the requests and responses. Make sure that both are compliant with the requirements in the **Endpoint request for image data** section and **Endpoint response for image data** section.

The following are two code examples showing how to send requests and examine the responses for both image classification and object detection problems.

### Image classification problem
<a name="clarify-processing-job-data-format-object-precheck-class"></a>

The following example code instructs an endpoint to read a PNG file and then classifies it.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-sagemaker-image-classification \
  --content-type "image/png" \
  --accept "application/json" \
  --body fileb://./test.png  \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
[0.1,0.6,0.3]
```

### Object detection problem
<a name="clarify-processing-job-data-format-object-precheck-object"></a>

The following example code instructs an endpoint to read a JPEG file and then detects the objects in it.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-sagemaker-object-detection \
  --content-type "image/jpg" \
  --accept "application/json" \
  --body fileb://./test.jpg  \
  /dev/stderr 1>/dev/null
```

From the previous code example, the response output follows.

```
{"prediction":[[4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244],[0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571, 0.9712159633636475],[4.0, 0.32643985450267792, 0.3677481412887573, 0.034883320331573486, 0.6318609714508057, 0.5967587828636169],[8.0, 0.22552496790885925, 0.6152569651603699, 0.5722782611846924, 0.882301390171051, 0.8985623121261597],[3.0, 0.42260299175977707, 0.019305512309074402, 0.08386176824569702, 0.39093565940856934, 0.9574796557426453]]}
```

# Time series data
<a name="clarify-processing-job-data-format-time-series"></a>

Time series data refers to data that can be loaded into a three-dimensional data frame. In the frame, in every timestamp, each row represents a target record, and each target record has one or more related columns. The values within each data frame cell can be of numerical, categorical, or text data types.

## Time series dataset prerequisites
<a name="clarify-processing-job-data-format-time-series-prereq"></a>

Prior to analysis, complete the necessary preprocessing steps to prepare your data, such as data cleaning or feature engineering. You can provide one or multiple datasets. If you provide multiple datasets, use one of the following methods to supply them to the SageMaker Clarify processing job:
+ Use either a [ProcessingInput](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html) named `dataset` or the analysis configuration `dataset_uri` to specify the main dataset. For more information about `dataset_uri`, see the parameters list in [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).
+ Use the `baseline` parameter provided in the analysis configuration file. The baseline dataset is required for `static_covariates`, if present. For more information about the analysis configuration file, including examples, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The following table lists supported data formats, their file extensions, and MIME types.


| Data format | File extension | MIME type | 
| --- | --- | --- | 
|  `item_records`  |  json  |  `application/json`  | 
|  `timestamp_records`  |  json  |  `application/json`  | 
|  `columns`  |  json  |  `application/json`  | 

JSON is a flexible format that can represent any level of complexity in your structured data. As shown in the table, SageMaker Clarify supports formats `item_records`, `timestamp_records`, and `columns`.

## Time series dataset config examples
<a name="clarify-processing-job-data-format-time-series-ex"></a>

This section shows you how to set an analysis configuration using `time_series_data_config` for time series data in JSON format. Suppose you have a dataset with two items, each with a timestamp (t), target time series (x), two related time series (r) and two static covariates (u) as follows:

 t1 = [0,1,2], t2 = [2,3]

x1 = [5,6,4], x2 = [0,4]

r1 = [0,1,0], r21 = [1,1]

r12 = [0,0,0], r22 = [1,0]

u11 = -1, u21 = 0

u12 = 1, u22 = 2

You can encode the dataset using `time_series_data_config` in three different ways, depending on `dataset_format`. The following sections describe each method.

### Time series data config when `dataset_format` is `columns`
<a name="clarify-processing-job-data-format-time-series-columns"></a>

The following example uses the `columns` value for `dataset_format`. The following JSON file represents the preceding dataset.

```
{
    "ids": [1, 1, 1, 2, 2],
    "timestamps": [0, 1, 2, 2, 3], # t
    "target_ts": [5, 6, 4, 0, 4], # x
    "rts1": [0, 1, 0, 1, 1], # r1
    "rts2": [0, 0, 0, 1, 0], # r2
    "scv1": [-1, -1, -1, 0, 0], # u1
    "scv2": [1, 1, 1, 2, 2], # u2
}
```

Note that the item ids are repeated in the `ids` field. The correct implementation of `time_series_data_config` is shown as follows:

```
"time_series_data_config": {
    "item_id": "ids",
    "timestamp": "timestamps",
    "target_time_series": "target_ts",
    "related_time_series": ["rts1", "rts2"],
    "static_covariates": ["scv1", "scv2"],
    "dataset_format": "columns"
}
```

### Time series data config when `dataset_format` is `item_records`
<a name="clarify-processing-job-data-format-time-series-itemrec"></a>

The following example uses the `item_records` value for `dataset_format`. The following JSON file represents the dataset.

```
[
    {
        "id": 1,
        "scv1": -1,
        "scv2": 1,
        "timeseries": [
            {"timestamp": 0, "target_ts": 5, "rts1": 0, "rts2": 0},
            {"timestamp": 1, "target_ts": 6, "rts1": 1, "rts2": 0},
            {"timestamp": 2, "target_ts": 4, "rts1": 0, "rts2": 0}
        ]
    },
    {
        "id": 2,
        "scv1": 0,
        "scv2": 2,
        "timeseries": [
            {"timestamp": 2, "target_ts": 0, "rts1": 1, "rts2": 1},
            {"timestamp": 3, "target_ts": 4, "rts1": 1, "rts2": 0}
        ]
    }
]
```

Each item is represented as a separate entry in the JSON. The following snippet shows the corresponding `time_series_data_config` (which uses JMESPath). 

```
"time_series_data_config": {
    "item_id": "[*].id",
    "timestamp": "[*].timeseries[].timestamp",
    "target_time_series": "[*].timeseries[].target_ts",
    "related_time_series": ["[*].timeseries[].rts1", "[*].timeseries[].rts2"],
    "static_covariates": ["[*].scv1", "[*].scv2"],
    "dataset_format": "item_records"
}
```

### Time series data config when `dataset_format` is `timestamp_record`
<a name="clarify-processing-job-data-format-time-series-tsrec"></a>

The following example uses the `timestamp_record` value for `dataset_format`. The following JSON file represents the preceding dataset.

```
[
    {"id": 1, "timestamp": 0, "target_ts": 5, "rts1": 0, "rts2": 0, "svc1": -1, "svc2": 1},
    {"id": 1, "timestamp": 1, "target_ts": 6, "rts1": 1, "rts2": 0, "svc1": -1, "svc2": 1},
    {"id": 1, "timestamp": 2, "target_ts": 4, "rts1": 0, "rts2": 0, "svc1": -1, "svc2": 1},
    {"id": 2, "timestamp": 2, "target_ts": 0, "rts1": 1, "rts2": 1, "svc1": 0, "svc2": 2},
    {"id": 2, "timestamp": 3, "target_ts": 4, "rts1": 1, "rts2": 0, "svc1": 0, "svc2": 2},
]
```

Each entry of the JSON represents a single timestamp and corresponds to a single item. The implementation `time_series_data_config` is shown as follows: 

```
{
    "item_id": "[*].id",
    "timestamp": "[*].timestamp",
    "target_time_series": "[*].target_ts",
    "related_time_series": ["[*].rts1"],
    "static_covariates": ["[*].scv1"],
    "dataset_format": "timestamp_records"
}
```

# Endpoint requests for time series data
<a name="clarify-processing-job-data-format-time-series-request-jsonlines"></a>

A SageMaker Clarify processing job serializes data into arbitrary JSON structures (with MIME type: `application/json`). To do this, you must provide a template string to the analysis configuration `content_template` parameter. This is used by the SageMaker Clarify processing job to construct the JSON query provided to your model. `content_template` contains a record or multiple records from your dataset. You must also provide a template string for `record_template`, which is used to construct the JSON structure of each record. These records are then inserted into `content_template`. For more information about `content_type` or `dataset_type`, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

**Note**  
Because `content_template` and `record_template` are string parameters, any double quote characters (") that are part of the JSON serialized structure should be noted as an escaped character in your configuration. For example, if you want to escape a double quote in Python, you could enter the following value for `content_template`:  

```
'$record'
```

The following table shows examples of serialized JSON request payloads and the corresponding `content_template` and `record_template` parameters required to construct them.


| Use case | Endpoint request payload (string representation) | content\$1template | record\$1template | 
| --- | --- | --- | --- | 
|  Single record at a time  |  `{"target": [1, 2, 3],"start": "2024-01-01 01:00:00"}`  |  `'$record'`  |  `'{"start": $start_time, "target": $target_time_series}'`  | 
|  Single record with `$related_time_series` and `$static_covariates`  |  `{"target": [1, 2, 3],"start": "2024-01-01 01:00:00","dynamic_feat": [[1.0, 2.0, 3.0],[1.0, 2.0, 3.0],"cat": [0,1]}`  |  `'$record'`  |  `'{"start": $start_time, "target": $target_time_series, "dynamic_feat": $related_time_series, "cat": $static_covariates}'`  | 
|  Multi-records  |  `{"instances": [{"target": [1, 2, 3],"start": "2024-01-01 01:00:00"}, {"target": [1, 2, 3],"start": "2024-01-01 02:00:00"}]}`  |  `'{"instances": $records}'`  |  `'{"start": $start_time, "target": $target_time_series}'`  | 
|  Multi-records with `$related_time_series` and `$static_covariates`  |  `{"instances": [{"target": [1, 2, 3],"start": "2024-01-01 01:00:00","dynamic_feat": [[1.0, 2.0, 3.0],[1.0, 2.0, 3.0],"cat": [0,1]}, {"target": [1, 2, 3],"start": "2024-01-01 02:00:00","dynamic_feat": [[1.0, 2.0, 3.0],[1.0, 2.0, 3.0],"cat": [0,1]}]}`  |  `'{"instances": $records}'`  |  `''{"start": $start_time, "target": $target_time_series, "dynamic_feat": $related_time_series, "cat": $static_covariates}'`  | 

# Endpoint response for time series data
<a name="clarify-processing-job-data-format-time-series-response-json"></a>

The SageMaker Clarify processing job deserializes the entire payload as JSON. It then extracts predictions from the deserialized data using JMESPath expressions provided in the analysis configuration. The records in the response payload must match the records in the request payload.

The following table is an example response from an endpoint that only outputs the mean prediction value. The value of `forecast` used in the `predictor` field in the [analysis config](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-analysis.html#clarify-processing-job-configure-analysis-parameters) should be provided as a JMESPath expression to find the prediction result for the processing job.


| Endpoint request payload | Endpoint response payload (string representation) | JMESPath expression for forecast in the analysis config | 
| --- | --- | --- | 
|  Single record example. Config should be `TimeSeriesModelConfig(forecast="prediction.mean")` to extract prediction properly.  |  `'{"prediction": {"mean": [1, 2, 3, 4, 5]}'`  |  `'prediction.mean'`  | 
|  Multiple records. An AWS deepAR endpoint response.  |  `'{"predictions": [{"mean": [1, 2, 3, 4, 5]}, {"mean": [1, 2, 3, 4, 5]}]}'`  |  `'predictions[*].mean'`  | 

# Pre-check endpoint request and response for time series data
<a name="clarify-processing-job-data-format-time-series-precheck"></a>

You are advised to deploy your model to a SageMaker AI real-time inference endpoint and send requests to the endpoint. Manually examine the requests and responses to make sure that both are compliant with the requirements in the [Endpoint requests for time series data](clarify-processing-job-data-format-time-series-request-jsonlines.md) and [Endpoint response for time series data](clarify-processing-job-data-format-time-series-response-json.md) sections. If your model container supports batch requests, you can start with a single record request and then try two or more records.

The following commands demonstrate how to request a response using the AWS CLI. The AWS CLI is pre-installed in Studio and SageMaker Notebook instances. To install the AWS CLI, follow the [installation guide](https://aws.amazon.com//cli/).

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type $CONTENT_TYPE \
  --accept $ACCEPT_TYPE \
  --body $REQUEST_DATA \
  $CLI_BINARY_FORMAT \
  /dev/stderr 1>/dev/null
```

The parameters are defined as follows:
+ \$1ENDPOINT NAME — The name of the endpoint.
+ \$1CONTENT\$1TYPE — The MIME type of the request (model container input).
+ \$1ACCEPT\$1TYPE — The MIME type of the response (model container output).
+ \$1REQUEST\$1DATA — The requested payload string.
+ \$1CLI\$1BINARY\$1FORMAT — The format of the command line interface (CLI) parameter. For AWS CLI v1, this parameter should remain blank. For v2, this parameter should be set to `--cli-binary-format raw-in-base64-out`.

**Note**  
AWS CLI v2 passes binary parameters as base64-encoded strings by default. The following request and response examples to and from the endpoint use AWS CLI v1. 

------
#### [ Example 1 ]

In the following code example, the request consists of a single record.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-json \
  --content-type application/json \
  --accept application/json \
  --body '{"target": [1, 2, 3, 4, 5],
    "start": "2024-01-01 01:00:00"}' \
/dev/stderr 1>/dev/null
```

The following snippet shows the corresponding response output.

```
{'predictions': {'mean': [1, 2, 3, 4, 5]}
```

------
#### [ Example 2 ]

In the following code example, the request contains two records.

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name test-endpoint-json-2 \
  --content-type application/json \
  --accept application/json \
  --body $'{"instances": [{"target":[1, 2, 3],
    "start":"2024-01-01 01:00:00",
    "dynamic_feat":[[1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5]]}], {"target":[1, 2, 3],
    "start":"2024-01-02 01:00:00",
    "dynamic_feat":[[1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5]]}]}' \
dev/stderr 1>/dev/null
```

The response output is the following:

```
{'predictions': [{'mean': [1, 2, 3, 4, 5]}, {'mean': [1, 2, 3, 4, 5]}]}
```

------

# Run SageMaker Clarify Processing Jobs for Bias Analysis and Explainability
<a name="clarify-processing-job-run"></a>

To analyze your data and models for bias and explainability using SageMaker Clarify, you must configure a SageMaker Clarify processing job. This guide shows how to configure the job inputs, outputs, resources, and analysis configuration using the SageMaker Python SDK API `SageMakerClarifyProcessor`. 

The API acts as a high-level wrapper of the SageMaker AI `CreateProcessingJob` API. It hides many of the details that are involved in setting up a SageMaker Clarify processing job. The details to set up a job include retrieving the SageMaker Clarify container image URI and generating the analysis configuration file. The following steps show you how to configure, initialize and launch a SageMaker Clarify processing job. 

**Configure a SageMaker Clarify processing job using the API**

1. Define the configuration objects for each portion of the job configuration. These portions can include the following:
   + The input dataset and output location: [DataConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.DataConfig).
   + The model or endpoint to be analyzed: [ModelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelConfig).
   + Bias analysis parameters: [BiasConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.BiasConfig).
   + SHapley Additive exPlanations (SHAP) analysis parameters: [SHAPConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SHAPConfig).
   + Asymmetric Shapley value analysis parameters (for time series only): [AsymmetricShapleyValueConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.AsymmetricShapleyValueConfig).

   The configuration objects for a SageMaker Clarify processing job vary for different types of data formats and use cases. Configuration examples for tabular data in [CSV](#clarify-processing-job-run-tabular-csv) and [JSON Lines](#clarify-processing-job-run-tabular-jsonlines) format, natural language processing ([NLP](#clarify-processing-job-run-tabular-nlp)), [computer vision](#clarify-processing-job-run-cv) (CV), and time series (TS) problems are provided in the following sections. 

1. Create a `SageMakerClarifyProcessor` object and initialize it with parameters that specify the job resources. These resources include parameters such as the number of compute instances to use.

   The following code example shows how to create a `SageMakerClarifyProcessor` object and instruct it to use one `ml.c4.xlarge` compute instance to do the analysis.

   ```
   from sagemaker import clarify
   
   clarify_processor = clarify.SageMakerClarifyProcessor(
       role=role,
       instance_count=1,
       instance_type='ml.c4.xlarge',
       sagemaker_session=session,
   )
   ```

1. Call the specific run method of the [SageMakerClarifyProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor.run) object with the configuration objects for your use case to launch the job. These run methods include the following:
   + `run_pre_training_bias`
   + `run_post_training_bias`
   + `run_bias`
   + `run_explainability`
   + `run_bias_and_explainability`

   This `SageMakerClarifyProcessor` handles several tasks behind the scenes. These tasks include retrieving the SageMaker Clarify container image universal resource identifier (URI), composing an analysis configuration file based on the provided configuration objects, uploading the file to an Amazon S3 bucket, and [configuring the SageMaker Clarify processing job](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-parameters.html).

   The following expandable sections show how to compute **pre-training** and **post-training bias metrics**, **SHAP values**, and **partial dependence plots** (PDPs). The sections show feature importance for these data types:
   + Tabular datasets in CSV format or JSON Lines format
   + Natural language processing (NLP) datasets
   + Computer vision datasets

A guide to run parallel SageMaker Clarify processing jobs with distributed training using **Spark** follows the expandable sections.

## Analyze tabular data in CSV format
<a name="clarify-processing-job-run-tabular-csv"></a>

The following examples show how to configure bias analysis and explainability analysis for a tabular dataset in CSV format. In these examples, the incoming dataset has four feature columns and one binary label column, `Target`. The contents of the dataset are as follows. A label value of `1` indicates a positive outcome. 

```
Target,Age,Gender,Income,Occupation
0,25,0,2850,2
1,36,0,6585,0
1,22,1,1759,1
0,48,0,3446,1
...
```

This `DataConfig` object specifies the input dataset and where to store the output. The `s3_data_input_path` parameter can either be a URI of a dataset file or an Amazon S3 URI prefix. If you provide a S3 URI prefix, the SageMaker Clarify processing job recursively collects all Amazon S3 files located under the prefix. The value for `s3_output_path` should be an S3 URI prefix to hold the analysis results. SageMaker AI uses the `s3_output_path` while compiling, and cannot take a value of a SageMaker AI Pipeline parameter, property, expression, or `ExecutionVariable`, which are used during runtime. The following code example shows how to specify a data configuration for the previous sample input dataset.

```
data_config = clarify.DataConfig(
    s3_data_input_path=dataset_s3_uri,
    dataset_type='text/csv',
    headers=['Target', 'Age', 'Gender', 'Income', 'Occupation'],
    label='Target',
    s3_output_path=clarify_job_output_s3_uri,
)
```

### How to compute all pre-training bias metrics for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-pretraining"></a>

The following code sample shows how to configure a `BiasConfig` object to measure bias of the previous sample input towards samples with a `Gender` value of `0`.

```
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1],
    facet_name='Gender',
    facet_values_or_threshold=[0],
)
```

The following code example shows how to use a run statement to launch a SageMaker Clarify processing job that computes all [pre-training bias metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html) for an input dataset. 

```
clarify_processor.run_pre_training_bias(
     data_config=data_config,
    data_bias_config=bias_config,
    methods="all",
)
```

Alternatively, you can choose which metrics to compute by assigning a list of pre-training bias metrics to the methods parameter. For example, replacing `methods="all"` with `methods=["CI", "DPL"]` instructs the SageMaker Clarify Processor to compute only [Class Imbalance](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-bias-metric-class-imbalance.html) and [Difference in Proportions of Labels](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-true-label-imbalance.html).

### How to compute all post-training bias metrics for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-posttraining"></a>

You can compute pre-training bias metrics prior to training. However, to compute [post-training bias metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html), you must have a trained model. The following example output is from a binary classification model that outputs data in CSV format. In this example output, each row contains two columns. The first column contains the predicted label, and the second column contains the probability value for that label.

```
0,0.028986845165491
1,0.825382471084594
...
```

In the following example configuration, the `ModelConfig` object instructs the job to deploy the SageMaker AI model to an ephemeral endpoint. The endpoint uses one `ml.m4.xlarge` inference instance. Because the parameter `content_type` and `accept_type` are not set, they automatically use the value of the parameter `dataset_type`, which is `text/csv`.

```
model_config = clarify.ModelConfig(
    model_name=your_model,
    instance_type='ml.m4.xlarge',
    instance_count=1,
)
```

The following configuration example uses a `ModelPredictedLabelConfig` object with a label index of `0`. This instructs the SageMaker Clarify processing job to locate the predicted label in the first column of the model output. The Processing job uses zero-based indexing in this example.

```
predicted_label_config = clarify.ModelPredictedLabelConfig(
    label=0,
)
```

Combined with the previous configuration example, the following code example launches a SageMaker Clarify processing job to compute all the post-training bias metrics.

```
clarify_processor.run_post_training_bias(
    data_config=data_config,
    data_bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predicted_label_config,
    methods="all",
)
```

Similarly, you can choose which metrics to compute by assigning a list of post-training bias metrics to the `methods` parameter. For example, replace `methods=“all”` with `methods=["DPPL", "DI"]` to compute only [Difference in Positive Proportions in Predicted Labels](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-dppl.html) and [Disparate Impact](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html).

### How to compute all bias metrics for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-all"></a>

The following configuration example shows how to run all pre-training and post-training bias metrics in one SageMaker Clarify processing job.

```
clarify_processor.run_bias(
    data_config=data_config,
     bias_config=bias_config,
     model_config=model_config,
    model_predicted_label_config=predicted_label_config,
    pre_training_methods="all",
    post_training_methods="all",
)
```

For an example notebook with instructions on how to run a SageMaker Clarify processing job in SageMaker Studio Classic to detect bias, see [Fairness and Explainability with SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.ipynb).

### How to compute SHAP values for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-shap"></a>

SageMaker Clarify provides feature attributions using the [KernelSHAP algorithm](https://arxiv.org/abs/1705.07874). SHAP analysis requires the probability value or score instead of predicted label, so this `ModelPredictedLabelConfig` object has probability index `1`. This instructs the SageMaker Clarify processing job to extract the probability score from the second column of the model output (using zero-based indexing).

```
probability_config = clarify.ModelPredictedLabelConfig(
    probability=1,
)
```

The `SHAPConfig` object provides SHAP analysis parameters. In this example, the SHAP `baseline` parameter is omitted and the value of the `num_clusters` parameter is `1`. This instructs the SageMaker Clarify Processor to compute one SHAP baseline sample based on clustering the input dataset. If you want to choose the baseline dataset, see [SHAP Baselines for Explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html).

```
shap_config = clarify.SHAPConfig(
    num_clusters=1,
)
```

The following code example launches a SageMaker Clarify processing job to compute SHAP values.

```
clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    model_scores=probability_config,
    explainability_config=shap_config,
)
```

For an example notebook with instructions on how to run a SageMaker Clarify processing job in SageMaker Studio Classic to compute SHAP values, see [Fairness and Explainability with SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.ipynb).

### How to compute partial dependence plots (PDPs) for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-pdp"></a>

PDPs show the dependence of the predicted target response on one or more input features of interest while holding all other features constant. An upward sloping line, or curve in the PDP, indicates that the relationship between the target and input feature(s) is positive, and the steepness indicates the strength of the relationship. A downward sloping line or curve indicates that if an input feature decreases, the target variable increases. Intuitively, you can interpret the partial dependence as the response of the target variable to each input feature of interest.

The following configuration example is for using a `PDPConfig` object to instruct the SageMaker Clarify processing job to compute the importance of the `Income` feature.

```
pdp_config = clarify.PDPConfig(
    features=["Income"],
    grid_resolution=10,
)
```

In the previous example, the `grid_resolution` parameter divides the range of the `Income` feature values into `10` buckets. The SageMaker Clarify processing job will generate PDPs for `Income` split into `10` segments on the x-axis. The y-axis will show the marginal impact of `Income` on the target variable.

The following code example launches a SageMaker Clarify processing job to compute PDPs.

```
clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    model_scores=probability_config,
    explainability_config=pdp_config,
)
```

For an example notebook with instructions on how to run a SageMaker Clarify processing job in SageMaker Studio Classic to compute PDPs, see [Explainability with SageMaker Clarify - Partial Dependence Plots (PDP)](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/fairness_and_explainability/explainability_with_pdp.ipynb).

### How to compute both SHAP values and PDPs for a CSV dataset
<a name="clarify-processing-job-run-tabular-csv-shap-pdp"></a>

You can compute both SHAP values and PDPs in a single SageMaker Clarify processing job. In the following configuration example, the `top_k_features` parameter of a new `PDPConfig` object is set to `2`. This instructs the SageMaker Clarify processing job to compute PDPs for the `2` features that have the largest global SHAP values. 

```
shap_pdp_config = clarify.PDPConfig(
    top_k_features=2,
    grid_resolution=10,
)
```

The following code example launches a SageMaker Clarify processing job to compute both SHAP values and PDPs.

```
clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    model_scores=probability_config,
    explainability_config=[shap_config, shap_pdp_config],
)
```

## Analyze tabular data in JSON Lines format
<a name="clarify-processing-job-run-tabular-jsonlines"></a>

The following examples show how to configure bias analysis and explainability analysis for a tabular dataset in >SageMaker AI JSON Lines dense format. See [JSONLINES request format](cdf-inference.md#cm-jsonlines) for more information. In these examples, the incoming dataset has the same data as the previous section, but they're in the JSON Lines format. Each line is a valid JSON object. The key `Features` points to an array of feature values, and the key `Label` points to the ground truth label.

```
{"Features":[25,0,2850,2],"Label":0}
{"Features":[36,0,6585,0],"Label":1}
{"Features":[22,1,1759,1],"Label":1}
{"Features":[48,0,3446,1],"Label":0}
...
```

In the following configuration example, the `DataConfig` object specifies the input dataset and where to store the output. 

```
data_config = clarify.DataConfig(
    s3_data_input_path=jsonl_dataset_s3_uri,
    dataset_type='application/jsonlines',
    headers=['Age', 'Gender', 'Income', 'Occupation', 'Target'],
    label='Label',
    features='Features',
    s3_output_path=clarify_job_output_s3_uri,
)
```

In the previous configuration example, the features parameter is set to the [JMESPath](https://jmespath.org/) expression `Features` so that the SageMaker Clarify processing job can extract the array of features from each record. The `label` parameter is set to JMESPath expression `Label` so that the SageMaker Clarify processing job can extract the ground truth label from each record. The `s3_data_input_path` parameter can either be a URI of a dataset file or an Amazon S3 URI prefix. If you provide a S3 URI prefix, the SageMaker Clarify processing job recursively collects all S3 files located under the prefix. The value for `s3_output_path` should be an S3 URI prefix to hold the analysis results. SageMaker AI uses the `s3_output_path` while compiling, and cannot take a value of a SageMaker AI Pipeline parameter, property, expression, or `ExecutionVariable`, which are used during runtime.

You must have a trained model to compute post-training bias metrics or feature importance. The following example is from a binary classification model that outputs JSON Lines data in the example's format. Each row of the model output is a valid JSON object. The key `predicted_label` points to the predicted label, and the key `probability` points to the probability value.

```
{"predicted_label":0,"probability":0.028986845165491}
{"predicted_label":1,"probability":0.825382471084594}
...
```

In the following configuration example, a `ModelConfig` object instructs the SageMaker Clarify processing job to deploy the SageMaker AI model to an ephemeral endpoint. The endpoint uses one `ml.m4.xlarge` inference instance.

```
model_config = clarify.ModelConfig(
    model_name=your_model,
    instance_type='ml.m4.xlarge',
    instance_count=1,
    content_template='{"Features":$features}',
)
```

In previous configuration example, the parameter `content_type` and `accept_type` are not set. Therefore, they automatically use the value of the `dataset_type` parameter of the `DataConfig` object, which is `application/jsonlines`. The SageMaker Clarify processing job uses the `content_template` parameter to compose the model input by replacing the `$features` placeholder by an array of features.

The following example configuration shows how to set the label parameter of the `ModelPredictedLabelConfig` object to the JMESPath expression `predicted_label`. This will extract the predicted label from the model output.

```
predicted_label_config = clarify.ModelPredictedLabelConfig(
    label='predicted_label',
)
```

The following example configuration shows how to set the `probability` parameter of the `ModelPredictedLabelConfig` object to the JMESPath expression `probability`. This will extract the score from the model output.

```
probability_config = clarify.ModelPredictedLabelConfig(
    probability='probability',
)
```

 To compute bias metrics and feature importance for datasets in JSON Lines format, use the same run statements and configuration objects as the previous section for CSV datasets. You can run a SageMaker Clarify processing job in SageMaker Studio Classic to detect bias and compute feature importance. For instructions and an example notebook, see [Fairness and Explainability with SageMaker Clarify (JSON Lines Format)](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_jsonlines_format.ipynb).

## Analyze tabular data for NLP explainability
<a name="clarify-processing-job-run-tabular-nlp"></a>

SageMaker Clarify supports explanations for natural language processing (NLP) models. These explanations help you understand which sections of text are the most important for your model predictions. You can explain either the model prediction for a single instance of the input dataset, or model predictions from the baseline dataset.To understand and visualize a model’s behavior, you can specify multiple levels of granularity. To do this, define the length of the text segment, such as its tokens, sentences, paragraphs.

SageMaker Clarify NLP explainability is compatible with both classification and regression models. You can also use SageMaker Clarify to explain your model's behavior on multi-modal datasets that contain text, categorical, or numerical features. NLP explainability for multi-modal datasets can help you understand how important each feature is to the model's output. SageMaker Clarify supports 62 languages and can handle text which includes multiple languages.

The following example shows an analysis configuration file for computing feature importance for NLP. In this example, the incoming dataset is a tabular dataset in CSV format, with one binary label column and two feature columns.

```
0,2,"Flavor needs work"
1,3,"They taste good"
1,5,"The best"
0,1,"Taste is awful"
...
```

The following configuration example shows how to specify an input dataset in CSV format and output data path using the `DataConfig` object.

```
nlp_data_config = clarify.DataConfig(
    s3_data_input_path=nlp_dataset_s3_uri,
    dataset_type='text/csv',
    headers=['Target', 'Rating', 'Comments'],
    label='Target',
    s3_output_path=clarify_job_output_s3_uri,
)
```

In the previous configuration example, the `s3_data_input_path` parameter can either be a URI of a dataset file or an Amazon S3 URI prefix. If you provide a S3 URI prefix, the SageMaker Clarify processing job recursively collects all S3 files located under the prefix. The value for `s3_output_path` should be an S3 URI prefix to hold the analysis results. SageMaker AI uses the `s3_output_path` while compiling, and cannot take a value of a SageMaker AI Pipeline parameter, property, expression, or `ExecutionVariable`, which are used during runtime.

The following example output was created from a binary classification model trained on the previous input dataset. The classification model accepts CSV data, and it outputs a single score in between `0` and `1`.

```
0.491656005382537
0.569582343101501
...
```

The following example shows how to configure the `ModelConfig` object to deploy a SageMaker AI model. In this example, an ephemeral endpoint deploys the model. This endpoint uses one `ml.g4dn.xlarge` inference instance equipped with a GPU, for accelerated inferencing.

```
nlp_model_config = clarify.ModelConfig(
    model_name=your_nlp_model_name,
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
)
```

The following example shows how to configure the `ModelPredictedLabelConfig` object to locate the probability (score) in the first column with an index of `0`.

```
probability_config = clarify.ModelPredictedLabelConfig(
    probability=0,
)
```

The following example SHAP configuration shows how to run a token-wise explainability analysis using a model and an input dataset in the English language.

```
text_config = clarify.TextConfig(
    language='english',
    granularity='token',
)
nlp_shap_config = clarify.SHAPConfig(
    baseline=[[4, '[MASK]']],
    num_samples=100,
    text_config=text_config,
)
```

In the previous example, the `TextConfig` object activates the NLP explainability analysis. The `granularity` parameter indicates that the analysis should parse tokens. In English, each token is a word. For other languages, see the [spaCy documentation for tokenization](https://spacy.io/usage/linguistic-features#tokenization), which SageMaker Clarify uses for NLP processing. The previous example also shows how to use an average `Rating` of `4` to set an in-place SHAP baseline instance. A special mask token `[MASK]` is used to replace a token (word) in `Comments`.

In the previous example, if the instance is `2,"Flavor needs work"`, set the baseline to an average `Rating` of `4` with the following baseline.

```
4, '[MASK]'
```

In the previous example, the SageMaker Clarify explainer iterates through each token and replaces it with the mask, as follows.

```
2,"[MASK] needs work"

4,"Flavor [MASK] work"

4,"Flavor needs [MASK]"
```

Then, the SageMaker Clarify explainer will send each line to your model for predictions. This is so that the explainer learns the predictions with and without the masked words. The SageMaker Clarify explainer then uses this information to compute the contribution of each token.

The following code example launches a SageMaker Clarify processing job to compute SHAP values.

```
clarify_processor.run_explainability(
    data_config=nlp_data_config,
    model_config=nlp_model_config,
    model_scores=probability_config,
    explainability_config=nlp_shap_config,
)
```

For an example notebook with instructions on how to run a SageMaker Clarify processing job in SageMaker Studio Classic for NLP explainability analysis, see [Explaining Text Sentiment Analysis Using SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/text_explainability/text_explainability.ipynb).

## Analyze image data for computer vision explainability
<a name="clarify-processing-job-run-cv"></a>

SageMaker Clarify generates heat maps that provide insights into how your computer vision models classify and detect objects in your images.

In the following configuration example, the input dataset consists of JPEG images.

```
cv_data_config = clarify.DataConfig(
    s3_data_input_path=cv_dataset_s3_uri,
    dataset_type="application/x-image",
    s3_output_path=clarify_job_output_s3_uri,
)
```

 In the previous configuration example, the `DataConfig` object contains an `s3_data_input_path` set to an Amazon S3 URI prefix. The SageMaker Clarify processing job recursively collects all image files located under the prefix. The `s3_data_input_path` parameter can either be a URI of a dataset file or an Amazon S3 URI prefix. If you provide a S3 URI prefix, the SageMaker Clarify processing job recursively collects all S3 files located under the prefix. The value for `s3_output_path` should be an S3 URI prefix to hold the analysis results. SageMaker AI uses the `s3_output_path` while compiling, and cannot take a value of a SageMaker AI Pipeline parameter, property, expression, or `ExecutionVariable`, which are used during runtime.

### How to explain an image classification model
<a name="clarify-processing-job-run-tabular-cv-image-classification"></a>

The SageMaker Clarify processing job explains images using the KernelSHAP algorithm, which treats the image as a collection of super pixels. Given a dataset consisting of images, the processing job outputs a dataset of images where each image shows the heat map of the relevant super pixels.

The following configuration example shows how to configure an explainability analysis using a SageMaker image classification model. See [Image Classification - MXNet](image-classification.md) for more information.

```
ic_model_config = clarify.ModelConfig(
    model_name=your_cv_ic_model,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    content_type="image/jpeg",
    accept_type="application/json",
)
```

In the previous configuration example, a model named `your_cv_ic_model`, has been trained to classify the animals on input JPEG images. The `ModelConfig` object in the previous example instructs the SageMaker Clarify processing job to deploy the SageMaker AI model to an ephemeral endpoint. For accelerated inferencing, the endpoint uses one `ml.p2.xlarge` inference instance equipped with a GPU.

After a JPEG image is sent to an endpoint, the endpoint classifies it and returns a list of scores. Each score is for a category. The `ModelPredictedLabelConfig` object provides the name of each category, as follows.

```
ic_prediction_config = clarify.ModelPredictedLabelConfig(
    label_headers=['bird', 'cat', 'dog'],
)
```

An example output for the previous input of ['bird','cat','dog'] could be 0.3,0.6,0.1, where 0.3 represents the confidence score for classifying an image as a bird.

The following example SHAP configuration shows how to generate explanations for an image classification problem. It uses an `ImageConfig` object to activate the analysis.

```
ic_image_config = clarify.ImageConfig(
    model_type="IMAGE_CLASSIFICATION",
    num_segments=20,
    segment_compactness=5,
)

ic_shap_config = clarify.SHAPConfig(
    num_samples=100,
    image_config=ic_image_config,
)
```

SageMaker Clarify extracts features using the [Simple Linear Iterative Clustering (SLIC)](https://scikit-image.org/docs/dev/api/skimage.segmentation.html#skimage.segmentation.slic) method from scikit-learn library for image segmentation. The previous configuration example, the `model_type` parameter, indicates the type of image classification problem. The parameter `num_segments` estimates how many approximate number of segments will be labeled in the input image. The number of segments is then passed to the slic `n_segments` parameter. 

Each segment of the image is considered a super-pixel, and local SHAP values are computed for each segment. The parameter `segment_compactness` determines the shape and size of the image segments that are generated by the scikit-image slic method. The sizes and shapes of the image segments are then passed to the slic `compactness` parameter.

The following code example launches a SageMaker Clarify processing job to generate heat maps for your images. Positive heat map values show that the feature increased the confidence score of detecting the object. Negative values indicate that the feature decreased the confidence score.

```
clarify_processor.run_explainability(
    data_config=cv_data_config,
    model_config=ic_model_config,
    model_scores=ic_prediction_config,
    explainability_config=ic_shap_config,
)
```

For a sample notebook that uses SageMaker Clarify to classify images and explain its classification, see [Explaining Image Classification with SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/computer_vision/image_classification/explainability_image_classification.ipynb).

### How to explain an object detection model
<a name="clarify-processing-job-run-tabular-cv-object-detection"></a>

A SageMaker Clarify processing job can detect and classify objects in an image and then provide an explanation for the detected object. The process for explanation is as follows.

1. Image objects are first categorized into one of the classes in a specified collection. For example, if an object detection model can recognize cat, dog and fish, then these three classes are in a collection. This collection is specified by the `label_headers` parameter as follows.

   ```
   clarify.ModelPredictedLabelConfig(
   
   label_headers=object_categories,
   
   )
   ```

1. The SageMaker Clarify processing job produces a confidence score for each object. A high confidence score indicates that it belongs to one of the classes in a specified collection. The SageMaker Clarify processing job also produces the coordinates of a bounding box that delimits the object. For more information about confidence scores and bounding boxes, see [Response Formats](object-detection-in-formats.md#object-detection-recordio).

1. SageMaker Clarify then provides an explanation for the detection of an object in the image scene. It uses the methods described in the **How to explain an image classification model** section.

In the following configuration example, a SageMaker AI object detection model `your_cv_od_model` is trained on JPEG images to identify the animals on them. 

```
od_model_config = clarify.ModelConfig(
    model_name=your_cv_ic_model,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    content_type="image/jpeg",
    accept_type="application/json",
)
```

The `ModelConfig` object in the previous configuration example instructs the SageMaker Clarify processing job to deploy the SageMaker AI model to an ephemeral endpoint. For accelerated imaging, this endpoint uses one `ml.p2.xlarge` inference instance equipped with a GPU.

In the following example configuration, the `ModelPredictedLabelConfig` object provides the name of each category for classification.

```
ic_prediction_config = clarify.ModelPredictedLabelConfig(
    label_headers=['bird', 'cat', 'dog'],
)
```

The following example SHAP configuration shows how to generate explanations for an object detection.

```
od_image_config = clarify.ImageConfig(
    model_type="OBJECT_DETECTION",
    num_segments=20,
    segment_compactness=5,
    max_objects=5,
    iou_threshold=0.5,
    context=1.0,
)
od_shap_config = clarify.SHAPConfig(
    num_samples=100,
    image_config=image_config,
)
```

In the previous example configuration, the `ImageConfig` object activates the analysis. The `model_type` parameter indicates that the type of problem is object detection. For a detailed description of the other parameters, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

The following code example launches a SageMaker Clarify processing job to generate heat maps for your images. Positive heat map values show that the feature increased the confidence score of detecting the object. Negative values indicate that the feature decreased the confidence score.

```
clarify_processor.run_explainability(
    data_config=cv_data_config,
    model_config=od_model_config,
    model_scores=od_prediction_config,
    explainability_config=od_shap_config,
)
```

For a sample notebook that uses SageMaker Clarify to detect objects in an image and explain its predictions, see [Explaining object detection models with Amazon SageMaker AI Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/computer_vision/object_detection/object_detection_clarify.ipynb).

## Analyze explanations for time series forecasting models
<a name="clarify-processing-job-run-ts"></a>

The following examples show how to configure data in SageMaker AI JSON dense format to explain a time series forecasting model. For more information about JSON formatting, see [JSON request format](cdf-inference.md#cm-json).

```
[
    {
        "item_id": "item1",
        "timestamp": "2019-09-11",
        "target_value": 47650.3,
        "dynamic_feature_1": 0.4576,
        "dynamic_feature_2": 0.2164,
        "dynamic_feature_3": 0.1906,
        "static_feature_1": 3,
        "static_feature_2": 4
    },
    {
        "item_id": "item1",
        "timestamp": "2019-09-12",
        "target_value": 47380.3,
        "dynamic_feature_1": 0.4839,
        "dynamic_feature_2": 0.2274,
        "dynamic_feature_3": 0.1889,
        "static_feature_1": 3,
        "static_feature_2": 4
    },
    {
        "item_id": "item2",
        "timestamp": "2020-04-23",
        "target_value": 35601.4,
        "dynamic_feature_1": 0.5264,
        "dynamic_feature_2": 0.3838,
        "dynamic_feature_3": 0.4604,
        "static_feature_1": 1,
        "static_feature_2": 2
    },
]
```

### Data config
<a name="clarify-processing-job-run-ts-dataconfig"></a>

Use `TimeSeriesDataConfig` communicate to your explainability job how to parse data correctly from the passed input dataset, as shown in the following example configuration:

```
time_series_data_config = clarify.TimeSeriesDataConfig(
    target_time_series='[].target_value',
    item_id='[].item_id',
    timestamp='[].timestamp',
    related_time_series=['[].dynamic_feature_1', '[].dynamic_feature_2', '[].dynamic_feature_3'],
    static_covariates=['[].static_feature_1', '[].static_feature_2'],
    dataset_format='timestamp_records',
)
```

### Asymmetric Shapley value config
<a name="clarify-processing-job-run-ts-asymm"></a>

Use `AsymmetricShapleyValueConfig` to define arguments for time series forecasting model explanation analysis such as baseline, direction, granularity, and number of samples. Baseline values are set for all three types of data: related time series, static covariates, and target time series. The `AsymmetricShapleyValueConfig` config informs the SageMaker Clarify processor how to compute feature attributions for one item at a time. The following configuration shows an example definition of `AsymmetricShapleyValueConfig`. 

```
asymmetric_shapley_value_config = AsymmetricShapleyValueConfig(
    direction="chronological",
    granularity="fine-grained",
    num_samples=10,
    baseline={
        "related_time_series": "zero", 
        "static_covariates": {
            "item1": [0, 0], "item2": [0, 0]
        }, 
        "target_time_series": "zero"
    },
)
```

The values you provide to `AsymmetricShapleyValueConfig` are passed to the analysis config as an entry in `methods` with key `asymmetric_shapley_value`.

### Model config
<a name="clarify-processing-job-run-ts-model"></a>

You can control the structure of the payload sent from the SageMaker Clarify processor. In the following code sample, a `ModelConfig` configuration object directs a time series forecasting explainability job to aggregate records using JMESPath syntax into `'{"instances": $records}'` , where the structure of each record is defined with the following record\$1template `'{"start": $start_time, "target": $target_time_series, "dynamic_feat": $related_time_series, "cat": $static_covariates}'`. Note that `$start_time`, `$target_time_series`, `$related_time_series`, and `$static_covariates` are internal tokens used to map dataset values to endpoint request values. 

```
model_config = clarify.ModelConfig(
    model_name=your_model,
    instance_type='ml.m4.xlarge',
    instance_count=1,
    record_template='{"start": $start_time, "target": $target_time_series, "dynamic_feat": $related_time_series, "cat": $static_covariates}',
    content_template='{"instances": $records}',,
    time_series_model_config=TimeSeriesModelConfig(
        forecast={'forecast': 'predictions[*].mean[:2]'}
    )
)
```

Similarly, the attribute `forecast` in `TimeSeriesModelConfig`, passed to the analysis config with key `time_series_predictor_config`, is used to extract the model forecast from the endpoint response. For instance, an example endpoint batch response could be the following:

```
{
    "predictions": [
        {"mean": [13.4, 3.6, 1.0]}, 
        {"mean": [23.0, 4.7, 3.0]}, 
        {"mean": [3.4, 5.6, 2.0]}
    ]
}
```

If the JMESPath expression provided for `forecast` is \$1'predictions[\$1].mean[:2]'\$1\$1, the forecast value is parsed as follows: 

```
[[13.4, 3.6], [23.0, 4.7], [3.4, 5.6]]
```

## How to run parallel SageMaker Clarify processing jobs
<a name="clarify-processing-job-run-spark"></a>

When working with large datasets, you can use [Apache Spark](https://spark.apache.org/) to increase the speed of your SageMaker Clarify processing jobs. Spark is a unified analytics engine for large-scale data processing. When you request more than one instance per SageMaker Clarify processor, SageMaker Clarify uses the distributed computing capabilities from Spark.

The following configuration example shows how to use `SageMakerClarifyProcessor` to create a SageMaker Clarify processor with `5` compute instances. To run any jobs associated with the `SageMakerClarifyProcessor`, SageMaker Clarify using Spark distributed processing.

```
from sagemaker import clarify

spark_clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=5,
    instance_type='ml.c5.xlarge',
)
```

If you set the `save_local_shap_values` parameter of [SHAPConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SHAPConfig) to `True`, the SageMaker Clarify processing job saves the local SHAP value as multiple part files in the job output location. 

To associate the local SHAP values to the input dataset instances, use the `joinsource` parameter of `DataConfig`. If you add more compute instances, we recommend that you also increase the `instance_count` of [ModelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelConfig) for the ephemeral endpoint. This prevents Spark workers' concurrent inference requests from overwhelming the endpoint. Specifically, we recommend that you use a one-to-one ratio of endpoint-to-processing instances.

# Analysis Results
<a name="clarify-processing-job-analysis-results"></a>

After a SageMaker Clarify processing job is finished, you can download the output files to inspect them, or you can visualize the results in SageMaker Studio Classic. The following topic describes the analysis results that SageMaker Clarify generates, such as the schema and the report that's generated by bias analysis, SHAP analysis, computer vision explainability analysis, and partial dependence plots (PDPs) analysis. If the configuration analysis contains parameters to compute multiple analyses, then the results are aggregated into one analysis and one report file.

The SageMaker Clarify processing job output directory contains the following files:
+ `analysis.json` – A file that contains bias metrics and feature importance in JSON format.
+ `report.ipynb` – A static notebook that contains code to help you visualize bias metrics and feature importance.
+ `explanations_shap/out.csv` – A directory that is created and contains automatically generated files based on your specific analysis configurations. For example, if you activate the `save_local_shap_values` parameter, then per-instance local SHAP values will be saved to the `explanations_shap` directory. As another example, if your `analysis configuration` does not contain a value for the SHAP baseline parameter, the SageMaker Clarify explainability job computes a baseline by clustering the input dataset. It then saves the generated baseline to the directory.

For more detailed information, see the following sections.

**Topics**
+ [Bias analysis](#clarify-processing-job-analysis-results-bias)
+ [SHAP analysis](#clarify-processing-job-analysis-results-shap)
+ [Computer vision (CV) explainability analysis](#clarify-processing-job-analysis-results-cv)
+ [Partial dependence plots (PDPs) analysis](#clarify-processing-job-analysis-results-pdp)
+ [Asymmetric Shapley values](#clarify-processing-job-analysis-results-asymmshap)

## Bias analysis
<a name="clarify-processing-job-analysis-results-bias"></a>

Amazon SageMaker Clarify uses the terminology documented in [Amazon SageMaker Clarify Terms for Bias and Fairness](clarify-detect-data-bias.md#clarify-bias-and-fairness-terms) to discuss bias and fairness.

### Schema for the analysis file
<a name="clarify-processing-job-analysis-results-bias-schema"></a>

The analysis file is in JSON format and is organized into two sections: pre-training bias metrics and post-training bias metrics. The parameters for pre-training and post-training bias metrics are as follows.
+ **pre\$1training\$1bias\$1metrics** – Parameters for pre-training bias metrics. For more information, see [Pre-training Bias Metrics](clarify-measure-data-bias.md) and [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).
  + **label** – The ground truth label name defined by the `label` parameter of the analysis configuration.
  + **label\$1value\$1or\$1threshold** – A string containing the label values or interval defined by the `label_values_or_threshold` parameter of the analysis configuration. For example, if value `1` is provided for binary classification problem, then the string will be `1`. If multiple values `[1,2]` are provided for multi-class problem, then the string will be `1,2`. If a threshold `40` is provided for regression problem, then the string will be an internal like `(40, 68]` in which `68` is the maximum value of the label in the input dataset.
  + **facets** – The section contains several key-value pairs, where the key corresponds to the facet name defined by the `name_or_index` parameter of the facet configuration, and the value is an array of facet objects. Each facet object has the following members:
    + **value\$1or\$1threshold** – A string containing the facet values or interval defined by the `value_or_threshold` parameter of the facet configuration.
    + **metrics** – The section contains an array of bias metric elements, and each bias metric element has the following attributes:
      + **name** – The short name of the bias metric. For example, `CI`. 
      + **description** – The full name of the bias metric. For example, `Class Imbalance (CI)`.
      + **value** – The bias metric value, or JSON null value if the bias metric is not computed for a particular reason. The values ±∞ are represented as strings `∞` and `-∞` respectively.
      + **error** – An optional error message that explains why the bias metric was not computed.
+ **post\$1training\$1bias\$1metrics** – The section contains the post-training bias metrics and it follows a similar layout and structure to the pre-training section. For more information, see [Post-training Data and Model Bias Metrics](clarify-measure-post-training-bias.md).

The following is an example of an analysis configuration that will calculate both pre-training and post-training bias metrics.

```
{
    "version": "1.0",
    "pre_training_bias_metrics": {
        "label": "Target",
        "label_value_or_threshold": "1",
        "facets": {
            "Gender": [{
                "value_or_threshold": "0",
                "metrics": [
                    {
                        "name": "CDDL",
                        "description": "Conditional Demographic Disparity in Labels (CDDL)",
                        "value": -0.06
                    },
                    {
                        "name": "CI",
                        "description": "Class Imbalance (CI)",
                        "value": 0.6
                    },
                    ...
                ]
            }]
        }
    },
    "post_training_bias_metrics": {
        "label": "Target",
        "label_value_or_threshold": "1",
        "facets": {
            "Gender": [{
                "value_or_threshold": "0",
                "metrics": [
                    {
                        "name": "AD",
                        "description": "Accuracy Difference (AD)",
                        "value": -0.13
                    },
                    {
                        "name": "CDDPL",
                        "description": "Conditional Demographic Disparity in Predicted Labels (CDDPL)",
                        "value": 0.04
                    },
                    ...
                ]
            }]
        }
    }
}
```

### Bias analysis report
<a name="clarify-processing-job-analysis-results-bias-report"></a>

The bias analysis report includes several tables and diagrams that contain detailed explanations and descriptions. These include, but are not limited to, the distribution of label values, the distribution of facet values, high-level model performance diagram, a table of bias metrics, and their descriptions. For more information about bias metrics and how to interpret them, see the [Learn How Amazon SageMaker Clarify Helps Detect Bias](https://aws.amazon.com/blogs/machine-learning/learn-how-amazon-sagemaker-clarify-helps-detect-bias/).

## SHAP analysis
<a name="clarify-processing-job-analysis-results-shap"></a>

SageMaker Clarify processing jobs use the Kernel SHAP algorithm to compute feature attributions. The SageMaker Clarify processing job produces both local and global SHAP values. These help to determine the contribution of each feature towards model predictions. Local SHAP values represent the feature importance for each individual instance, while global SHAP values aggregate the local SHAP values across all instances in the dataset. For more information about SHAP values and how to interpret them, see [Feature Attributions that Use Shapley Values](clarify-shapley-values.md).

### Schema for the SHAP analysis file
<a name="clarify-processing-job-analysis-results-shap-schema"></a>

Global SHAP analysis results are stored in the explanations section of the analysis file, under the `kernel_shap` method. The different parameters of the SHAP analysis file are as follows:
+ **explanations** – The section of the analysis file that contains the feature importance analysis results.
  + **kernal\$1shap** – The section of the analysis file that contains the global SHAP analysis result.
    + **global\$1shap\$1values** – A section of the analysis file that contains several key-value pairs. Each key in the key-value pair represents a feature name from the input dataset. Each value in the key-value pair corresponds to the feature's global SHAP value. The global SHAP value is obtained by aggregating the per-instance SHAP values of the feature using the `agg_method` configuration. If the `use_logit` configuration is activated, then the value is calculated using the logistic regression coefficients, which can be interpreted as log-odds ratios.
    + **expected\$1value** – The mean prediction of the baseline dataset. If the `use_logit` configuration is activated, then the value is calculated using the logistic regression coefficients.
    + **global\$1top\$1shap\$1text** – Used for NLP explainability analysis. A section of the analysis file that includes a set of key-value pairs. SageMaker Clarify processing jobs aggregate the SHAP values of each token and then select the top tokens based on their global SHAP values. The `max_top_tokens` configuration defines the number of tokens to be selected. 

      Each of the selected top tokens has a key-value pair. The key in the key-value pair corresponds to a top token’s text feature name. Each value in the key-value pair is the global SHAP values of the top token. For an example of a `global_top_shap_text` key-value pair, see the following output.

The following example shows output from the SHAP analysis of a tabular dataset.

```
{
    "version": "1.0",
    "explanations": {
        "kernel_shap": {
            "Target": {
                 "global_shap_values": {
                    "Age": 0.022486410860333206,
                    "Gender": 0.007381025261958729,
                    "Income": 0.006843906804137847,
                    "Occupation": 0.006843906804137847,
                    ...
                },
                "expected_value": 0.508233428001
            }
        }
    }
}
```

The following example shows output from the SHAP analysis of a text dataset. The output corresponding to the column `Comments` is an example of output that is generated after analysis of a text feature.

```
{
    "version": "1.0",
    "explanations": {
        "kernel_shap": {
            "Target": {
               "global_shap_values": {
                    "Rating": 0.022486410860333206,
                    "Comments": 0.058612104851485144,
                    ...
                },
                "expected_value": 0.46700941970297033,
                "global_top_shap_text": {
                    "charming": 0.04127962903247833,
                    "brilliant": 0.02450240786522321,
                    "enjoyable": 0.024093569652715457,
                    ...
                }
            }
        }
    }
}
```

### Schema for the generated baseline file
<a name="clarify-processing-job-analysis-results-baseline-schema"></a>

When a SHAP baseline configuration is not provided, the SageMaker Clarify processing job generates a baseline dataset. SageMaker Clarify uses a distance-based clustering algorithm to generate a baseline dataset from clusters created from the input dataset. The resulting baseline dataset is saved in a CSV file, located at `explanations_shap/baseline.csv`. This output file contains a header row and several instances based on the `num_clusters` parameter that is specified in the analysis configuration. The baseline dataset only consists of feature columns. The following example shows a baseline created by clustering the input dataset.

```
Age,Gender,Income,Occupation
35,0,2883,1
40,1,6178,2
42,0,4621,0
```

### Schema for local SHAP values from tabular dataset explainability analysis
<a name="clarify-processing-job-analysis-results-tabular-schema"></a>

For tabular datasets, if a single compute instance is used, the SageMaker Clarify processing job saves the local SHAP values to a CSV file named `explanations_shap/out.csv`. If you use multiple compute instances, local SHAP values are saved to several CSV files in the `explanations_shap` directory.

An output file containing local SHAP values has a row containing the local SHAP values for each column that is defined by the headers. The headers follow the naming convention of `Feature_Label` where the feature name is appended by an underscore, followed by the name of the your target variable. 

For multi-class problems, the feature names in the header vary first, then labels. For example, two features `F1, F2`, and two classes `L1` and `L2`, in headers are `F1_L1`, `F2_L1`, `F1_L2`, and `F2_L2`. If the analysis configuration contains a value for the `joinsource_name_or_index` parameter, then the key column used in the join is appended to the end of the header name. This allows mapping of the local SHAP values to instances of the input dataset. An example of an output file containing SHAP values follows.

```
Age_Target,Gender_Target,Income_Target,Occupation_Target
0.003937908,0.001388849,0.00242389,0.00274234
-0.0052784,0.017144491,0.004480645,-0.017144491
...
```

### Schema for local SHAP values from NLP explainability analysis
<a name="clarify-processing-job-analysis-results-nlp-schema"></a>

For NLP explainability analysis, if a single compute instance is used, the SageMaker Clarify processing job saves local SHAP values to a JSON Lines file named `explanations_shap/out.jsonl`. If you use multiple compute instances, the local SHAP values are saved to several JSON Lines files in the `explanations_shap` directory.

Each file containing local SHAP values has several data lines, and each line is a valid JSON object. The JSON object has the following attributes:
+ **explanations** – The section of the analysis file that contains an array of Kernel SHAP explanations for a single instance. Each element in the array has the following members:
  + **feature\$1name** – The header name of the features provided by the headers configuration.
  + **data\$1type** – The feature type inferred by the SageMaker Clarify processing job. Valid values for text features include `numerical`, `categorical`, and `free_text` (for text features).
  + **attributions** – A feature-specific array of attribution objects. A text feature can have multiple attribution objects, each for a unit defined by the `granularity` configuration. The attribution object has the following members:
    + **attribution** – A class-specific array of probability values.
    + **description** – (For text features) The description of the text units.
      + **partial\$1text** – The portion of the text explained by the SageMaker Clarify processing job.
      + **start\$1idx** – A zero-based index to identify the array location indicating the beginning of the partial text fragment.

The following is an example of a single line from a local SHAP values file, beautified to enhance its readability.

```
{
    "explanations": [
        {
            "feature_name": "Rating",
            "data_type": "categorical",
            "attributions": [
                {
                    "attribution": [0.00342270632248735]
                }
            ]
        },
        {
            "feature_name": "Comments",
            "data_type": "free_text",
            "attributions": [
                {
                    "attribution": [0.005260534499999983],
                    "description": {
                        "partial_text": "It's",
                        "start_idx": 0
                    }
                },
                {
                    "attribution": [0.00424190349999996],
                    "description": {
                        "partial_text": "a",
                        "start_idx": 5
                    }
                },
                {
                    "attribution": [0.010247314500000014],
                    "description": {
                        "partial_text": "good",
                        "start_idx": 6
                    }
                },
                {
                    "attribution": [0.006148907500000005],
                    "description": {
                        "partial_text": "product",
                        "start_idx": 10
                    }
                }
            ]
        }
    ]
}
```

### SHAP analysis report
<a name="clarify-processing-job-analysis-results-shap-report"></a>

The SHAP analysis report provides a bar chart of a maximum of `10` top global SHAP values. The following chart example shows the SHAP values for the top `4` features.

![\[Horizontal bar chart of global SHAP values calculated for target variable of the top four features.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/shap-chart.png)


## Computer vision (CV) explainability analysis
<a name="clarify-processing-job-analysis-results-cv"></a>

SageMaker Clarify computer vision explainability takes a dataset consisting of images and treats each image as a collection of super pixels. After analysis, the SageMaker Clarify processing job outputs a dataset of images where each image shows the heat map of the super pixels.

The following example shows an input speed limit sign on the left and a heat map shows the magnitude of SHAP values on the right. These SHAP values were calculated by an image recognition Resnet-18 model that is trained to recognize [German traffic signs](https://benchmark.ini.rub.de/gtsrb_news.html). The German Traffic Sign Recognition Benchmark (GTSRB) dataset is provided in the paper [Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition](https://www.sciencedirect.com/science/article/abs/pii/S0893608012000457?via%3Dihub). In the example output, large positive values indicate that the super pixel has a strong positive correlation with the model prediction. Large negative values indicate that the super pixel has a strong negative correlation with the model prediction. The larger the absolute value of the SHAP value shown in the heat map, the stronger the relationship between the super pixel and model prediction.

![\[Input image of speed limit sign and resulting heat map of SHAP values from a Resnet-18 model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/shap_speed-limit-70.png)


For more information, see the sample notebooks [Explaining Image Classification with SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/computer_vision/image_classification/explainability_image_classification.ipynb) and [Explaining object detection models with Amazon SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/computer_vision/object_detection/object_detection_clarify.ipynb).

## Partial dependence plots (PDPs) analysis
<a name="clarify-processing-job-analysis-results-pdp"></a>

Partial dependence plots show the dependence of the predicted target response on a set of input features of interest. These are marginalized over the values of all other input features and are referred to as the complement features. Intuitively, you can interpret the partial dependence as the target response, which is expected as a function of each input feature of interest.

### Schema for the analysis file
<a name="clarify-processing-job-analysis-results-pdp-schema"></a>

The PDP values are stored in the `explanations` section of the analysis file under the `pdp` method. The parameters for `explanations` are as follows:
+ **explanations** – The section of the analysis files that contains feature importance analysis results.
  + **pdp** – The section of the analysis file that contains an array of PDP explanations for a single instance. Each element of the array has the following members:
    + **feature\$1name** – The header name of the features provided by the `headers` configuration.
    + **data\$1type** – The feature type inferred by the SageMaker Clarify processing job. Valid values for `data_type` include numerical and categorical.
    + **feature\$1values** – Contains the values present in the feature. If the `data_type` inferred by SageMaker Clarify is categorical, `feature_values` contains all of the unique values that the feature could be. If the `data_type` inferred by SageMaker Clarify is numerical, `feature_values` contains a list of the central value of generated buckets. The `grid_resolution` parameter determines the number of buckets used to group the feature column values.
    + **data\$1distribution** – An array of percentages, where each value is the percentage of instances that a bucket contains. The `grid_resolution` parameter determines the number of buckets. The feature column values are grouped into these buckets.
    + **model\$1predictions** – An array of model predictions, where each element of the array is an array of predictions that corresponds to one class in the model’s output.

      **label\$1headers** – The label headers provided by the `label_headers` configuration.
    + **error** – An error message generated if the PDP values are not computed for a particular reason. This error message replaces the content contained in the `feature_values`, `data_distributions`, and `model_predictions` fields.

The following is example output from an analysis file containing a PDP analysis result.

```
{
    "version": "1.0",
    "explanations": {
        "pdp": [
            {
                "feature_name": "Income",
                "data_type": "numerical",
                "feature_values": [1046.9, 2454.7, 3862.5, 5270.2, 6678.0, 8085.9, 9493.6, 10901.5, 12309.3, 13717.1],
                "data_distribution": [0.32, 0.27, 0.17, 0.1, 0.045, 0.05, 0.01, 0.015, 0.01, 0.01],
                "model_predictions": [[0.69, 0.82, 0.82, 0.77, 0.77, 0.46, 0.46, 0.45, 0.41, 0.41]],
                "label_headers": ["Target"]
            },
            ...
        ]
    }
}
```

### PDP analysis report
<a name="clarify-processing-job-analysis-results-pdp-report"></a>

You can generate an analysis report containing a PDP chart for each feature. The PDP chart plots `feature_values` along the x-axis, and it plots `model_predictions` along the y-axis. For multi-class models, `model_predictions` is an array, and each element of this array corresponds to one of the model prediction classes.

The following is an example of PDP chart for the feature `Age`. In the example output, the PDP shows the number of feature values that are grouped into buckets. The number of buckets is determined by `grid_resolution`. The buckets of feature values are plotted against model predictions. In this example, the higher feature values have the same model prediction values.

![\[Line chart showing how model predictions vary against feature_values for 10 unique grid points.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/pdp-chart.png)


## Asymmetric Shapley values
<a name="clarify-processing-job-analysis-results-asymmshap"></a>

SageMaker Clarify processing jobs use the asymmetric Shapley value algorithm to compute time series forecasting model explanation attributions. This algorithm determines the contribution of input features at each time step toward the forecasted predictions.

### Schema for the asymmetric Shapley values analysis file
<a name="clarify-processing-job-analysis-results-shap-schema-assym"></a>

Asymmetric Shapley value results are stored in an Amazon S3 bucket. You can find the location of this bucket in the section *explanations* of the analysis file. This section contains the feature importance analysis results. The following parameters are included in the asymmetric Shapley value analysis file.
+ **asymmetric\$1shapley\$1value** — The section of the analysis file that contains metadata about the explanation job results, including the following:
  + **explanation\$1results\$1path** — The Amazon S3 location with the explanation results
  + **direction** — The user-provided configuration for the config value of `direction`
  + **granularity** — The user-provided configuration for the config value of `granularity`

The following snippet shows the previously mentioned parameters in an example analysis file:

```
{
    "version": "1.0",
    "explanations": {
        "asymmetric_shapley_value": {
            "explanation_results_path": EXPLANATION_RESULTS_S3_URI,
           "direction": "chronological",
           "granularity": "timewise",
        }
    }
}
```

The following sections describe how the explanation results structure depends on the value of `granularity` in the config.

#### Timewise granularity
<a name="clarify-processing-job-analysis-results-shap-schema-timewise"></a>

When the granularity is `timewise` the output is represented in the following structure. The `scores` value represents the attribution for each timestamp. The `offset` value represents the prediction of the model on the baseline data and describes the behavior of the model when it does not receive data.

The following snippet shows example output for a model which makes predictions for two time steps. Therefore, all attributions are list of two elements where the first entry refers to the first predicted time step.

```
{
    "item_id": "item1",
    "offset": [1.0, 1.2],
    "explanations": [
        {"timestamp": "2019-09-11 00:00:00", "scores": [0.11, 0.1]},
        {"timestamp": "2019-09-12 00:00:00", "scores": [0.34, 0.2]},
        {"timestamp": "2019-09-13 00:00:00", "scores": [0.45, 0.3]},
    ]
}
{
    "item_id": "item2",
    "offset": [1.0, 1.2],
    "explanations": [
        {"timestamp": "2019-09-11 00:00:00", "scores": [0.51, 0.35]},
        {"timestamp": "2019-09-12 00:00:00", "scores": [0.14, 0.22]},
        {"timestamp": "2019-09-13 00:00:00", "scores": [0.46, 0.31]},
    ]
}
```

#### Fine-grained granularity
<a name="clarify-processing-job-analysis-results-shap-schema-fine"></a>

The following example demonstrates attribution results when granularity is `fine_grained`. The `offset` value has the same meaning as described in the previous section. The attributions are computed for each input feature at each timestamp for a target time series and related time series, if available, and for each static covariate, if available.

```
{
    "item_id": "item1",
    "offset": [1.0, 1.2],
    "explanations": [
        {"feature_name": "tts_feature_name_1", "timestamp": "2019-09-11 00:00:00", "scores": [0.11, 0.11]},
        {"feature_name": "tts_feature_name_1", "timestamp": "2019-09-12 00:00:00", "scores": [0.34, 0.43]},
        {"feature_name": "tts_feature_name_2", "timestamp": "2019-09-11 00:00:00", "scores": [0.15, 0.51]},
        {"feature_name": "tts_feature_name_2", "timestamp": "2019-09-12 00:00:00", "scores": [0.81, 0.18]},
        {"feature_name": "rts_feature_name_1", "timestamp": "2019-09-11 00:00:00", "scores": [0.01, 0.10]},
        {"feature_name": "rts_feature_name_1", "timestamp": "2019-09-12 00:00:00", "scores": [0.14, 0.41]},
        {"feature_name": "rts_feature_name_1", "timestamp": "2019-09-13 00:00:00", "scores": [0.95, 0.59]},
        {"feature_name": "rts_feature_name_1", "timestamp": "2019-09-14 00:00:00", "scores": [0.95, 0.59]},
        {"feature_name": "rts_feature_name_2", "timestamp": "2019-09-11 00:00:00", "scores": [0.65, 0.56]},
        {"feature_name": "rts_feature_name_2", "timestamp": "2019-09-12 00:00:00", "scores": [0.43, 0.34]},
        {"feature_name": "rts_feature_name_2", "timestamp": "2019-09-13 00:00:00", "scores": [0.16, 0.61]},
        {"feature_name": "rts_feature_name_2", "timestamp": "2019-09-14 00:00:00", "scores": [0.95, 0.59]},
        {"feature_name": "static_covariate_1", "scores": [0.6, 0.1]},
        {"feature_name": "static_covariate_2", "scores": [0.1, 0.3]},
    ]
}
```

For both `timewise` and `fine-grained` use cases, the results are stored in JSON Lines (.jsonl) format.

# Troubleshoot SageMaker Clarify Processing Jobs
<a name="clarify-processing-job-run-troubleshooting"></a>

 If you encounter failures with SageMaker Clarify processing jobs, consult the following scenarios to help identify the issue.

**Note**  
The failure reason and exit message are intended to contain descriptive messages and exceptions, if encountered, during the run. A common reason for errors is that parameters are either missing or not valid. If you encounter unclear, confusing, or misleading messages or are unable to find a solution, submit feedback.

**Topics**
+ [Processing job fails to finish](#clarify-troubleshooting-job-fails)
+ [Processing job is taking too long to run](#clarify-troubleshooting-job-long)
+ [Processing job finishes without results and you get a CloudWatch warning message](#clarify-troubleshooting-no-results-and-warning)
+ [Error message for invalid analysis configuration](#clarify-troubleshooting-invalid-analysis-configuration)
+ [Bias metric computation fails for several or all metrics](#clarify-troubleshooting-bias-metric-computation-fails)
+ [Mismatch between analysis config and dataset/model input/output](#clarify-troubleshooting-mismatch-analysis-config-and-data-model)
+ [Model returns 500 Internal Server Error or container falls back to per-record predictions due to model error](#clarify-troubleshooting-500-internal-server-error)
+ [Execution role is invalid](#clarify-troubleshooting-execution-role-invalid)
+ [Failed to download data](#clarify-troubleshooting-data-download)
+ [Could not connect to SageMaker AI](#clarify-troubleshooting-connection)

## Processing job fails to finish
<a name="clarify-troubleshooting-job-fails"></a>

If the processing job fails to finish, you can try the following:
+ Inspect the job logs directly in the notebook where you ran the job in. The job logs are located in the output of the notebook cell where you initiated the run.
+ Inspect the job logs in CloudWatch.
+ Add the following line in your notebook to describe the last processing job and look for the failure reason and exit message:
  + `clarify_processor.jobs[-1].describe()`
+ Run the following AWS CLI; command to describe the processing job and look for the failure reason and exit message:
  + `aws sagemaker describe-processing-job —processing-job-name <processing-job-id>`

## Processing job is taking too long to run
<a name="clarify-troubleshooting-job-long"></a>

If your processing job is taking too long to run, use the following ways to find the root cause.

Check to see if your resource configuration is sufficient to handle your computing load. To speed up your job, try the following:
+ Use a larger instance type. SageMaker Clarify queries the model repeatedly, and a larger instance can significantly reduce your computation time. For a list of available instances, their memory sizes, bandwidth, and other performance details, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).
+ Add more instances. SageMaker Clarify can use multiple instances to explain multiple input data points in parallel. To enable parallel computing, set your `instance_count` to more than `1` when you call `SageMakerClarifyProcessor`. For more information, see [How to run parallel SageMaker Clarify processing jobs](clarify-processing-job-run.md#clarify-processing-job-run-spark). If you increase your instance count, monitor the performance of your endpoint to check that it can deploy the increased load. For more information, see [Capture data from real-time endpoint](model-monitor-data-capture-endpoint.md). 
+ If you're computing SHapley Additive exPlanations (SHAP) values, reduce the `num_samples` parameter in your analysis configuration file. The number of samples directly affects the following:
  + The size of the synthetic datasets that are sent to your endpoint
  + Job runtime

  Reducing the number of samples can also lead to reduced accuracy in estimating SHAP values. For more information, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

## Processing job finishes without results and you get a CloudWatch warning message
<a name="clarify-troubleshooting-no-results-and-warning"></a>

If the processing job finishes but no results are found, the CloudWatch logs produce a warning message that says Signal 15 received, cleaning up.This warning indicates that the job was stopped either because a customer request called the `StopProcessingJob` API, or that the job ran out of the allotted time for its completion. In the latter case, check the maximum runtime in the job configuration (`max_runtime_in_seconds`) and increase it as needed.

## Error message for invalid analysis configuration
<a name="clarify-troubleshooting-invalid-analysis-configuration"></a>
+  If you get the error message Unable to load analysis configuration as JSON., this means that the analysis configuration input file for the processing job does not contain a valid JSON object. Check the validity of the JSON object using a JSON linter.
+ If you get the error message Analysis configuration schema validation error., this means that the analysis configuration input file for the processing job contains unknown fields or invalid types for some field values. Review the configuration parameters in the file and cross-check them with the parameters listed in the analysis configuration file. For more information, see [Analysis Configuration Files](clarify-processing-job-configure-analysis.md).

## Bias metric computation fails for several or all metrics
<a name="clarify-troubleshooting-bias-metric-computation-fails"></a>

If your receive one of the following error messages No Label values are present in the predicted Label Column, Positive Predicted Index Series contains all False values. or Predicted Label Column series data type is not the same as Label Column series., try the following:
+ Check that the correct dataset is being used.
+ Check whether the dataset size is too small; whether, for example, it contains only a few rows. This may cause the model outputs to have the same value or the data type is inferred incorrectly.
+ Check if the label or facet is treated as continuous or categorical. SageMaker Clarify uses heuristics to determine the [https://github.com/aws/amazon-sagemaker-clarify/blob/master/src/smclarify/bias/metrics/common.py#L114)](https://github.com/aws/amazon-sagemaker-clarify/blob/master/src/smclarify/bias/metrics/common.py#L114)). For post-training bias metrics, the data type returned by the model may not match what is in the dataset or SageMaker Clarify may not be able to transform it correctly. 
  + In the bias report, you should see a single value for categorical columns or an interval for continuous columns.
  + For example, if a column has values 0.0 and 1.0 as floats, it will be treated as continuous even if there are too few unique values.

## Mismatch between analysis config and dataset/model input/output
<a name="clarify-troubleshooting-mismatch-analysis-config-and-data-model"></a>
+ Check that the baseline format in the analysis config is the same as dataset format.
+ If your receive the error message Could not convert string to float., check that the format is correctly specified. It could also indicate that the model predictions have a different format than the label column or it could indicate that the configuration for the label or probabilities is incorrect.
+ If your receive the error message Unable to locate the facet. or Headers must contain label. or Headers in config do not match with the number of columns in the dataset. or Feature names not found., check that the headers match the columns.
+ If your receive the error message Data must contain features., check the content template for JSON Lines and compare it with the dataset sample if available.

## Model returns 500 Internal Server Error or container falls back to per-record predictions due to model error
<a name="clarify-troubleshooting-500-internal-server-error"></a>

If you receive the error message Fallback to per-record prediction because of model error., this could indicate that model cannot handle the batch size, or be throttled, or just does not accept the input passed by the container due to serialization problems. You should review the CloudWatch logs for the SageMaker AI endpoint and look for error messages or tracebacks. For model throttling cases, it may help to use a different instance type or increasing the number of instances for the endpoint.

## Execution role is invalid
<a name="clarify-troubleshooting-execution-role-invalid"></a>

This indicates that the role provided is incorrect or missing required permissions. Check the role and its permissions that were used to configure the processing job and verify the permission and trust policy for the role.

## Failed to download data
<a name="clarify-troubleshooting-data-download"></a>

This indicates that job inputs could not be downloaded for the job to start. Check the bucket name and permissions for the dataset and the configuration inputs.

## Could not connect to SageMaker AI
<a name="clarify-troubleshooting-connection"></a>

This indicates that the job could not reach SageMaker AI service endpoints. Check the network configuration settings for the processing job and verify virtual private cloud (VPC) configuration.

## Sample notebooks
<a name="clarify-fairness-and-explainability-sample-notebooks"></a>

The following sections contains notebooks to help you get started using SageMaker Clarify, to use it for special tasks, including those inside a distributed job, and for computer vision.

### Getting started
<a name="clarify-fairness-and-explainability-sample-notebooks-getting-started"></a>

The following sample notebooks show how to use SageMaker Clarify to get started with explainability and model bias tasks. These tasks include creating a processing job, training a machine learning (ML) model, and monitoring model predictions:
+ [Explainability and bias detection with Amazon SageMaker Clarify](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.html) – Use SageMaker Clarify to create a processing job to detect bias and explain model predictions.
+ [Monitoring bias drift and feature attribution drift Amazon SageMaker Clarify](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/fairness_and_explainability/SageMaker-Model-Monitor-Fairness-and-Explainability.html) – Use Amazon SageMaker Model Monitor to monitor bias drift and feature attribution drift over time.
+ How to [read a dataset in JSON Lines format](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_jsonlines_format.html) into a SageMaker Clarify processing job.
+ [Mitigate Bias, train another unbiased model, and put it in the model registry](https://github.com/aws/amazon-sagemaker-examples/blob/master/end_to_end/fraud_detection/3-mitigate-bias-train-model2-registry-e2e.ipynb) – Use [Synthetic Minority Over-sampling Technique (SMOTE)](https://arxiv.org/pdf/1106.1813.pdf) and SageMaker Clarify to mitigate bias, train another model, then put the new model into the model registry. This sample notebook also shows how to place the new model artifacts, including data, code and model metadata, into the model registry. This notebook is part of a series that shows how to integrate SageMaker Clarify into a SageMaker AI pipeline that is described in the [Architect and build the full machine learning lifecycle with AWS](https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/) blog post.

### Special cases
<a name="clarify-post-training-bias-model-explainability-sample-notebooks"></a>

The following notebooks show you how to use a SageMaker Clarify for special cases including inside your own container and for natural language processing tasks:
+ [Fairness and Explainability with SageMaker Clarify (Bring Your Own Container)](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_byoc.ipynb) – Build your own model and container that can integrate with SageMaker Clarify to measure bias and generate an explainability analysis report. This sample notebook also introduces key terms and shows you how to access the report through SageMaker Studio Classic.
+ [Fairness and Explainability with SageMaker Clarify Spark Distributed Processing](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_spark.ipynb) – Use distributed processing to run a SageMaker Clarify job that measures the pre-training bias of a dataset and the post-training bias of a model. This sample notebook also shows you how to obtain an explanation for the importance of the input features on the model output, and access the explainability analysis report through SageMaker Studio Classic.
+ [Explainability with SageMaker Clarify - Partial Dependence Plots (PDP)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/explainability_with_pdp.html) – Use SageMaker Clarify to generate PDPs and access a model explainability report.
+  [Explaining text sentiment analysis using SageMaker Clarify Natural language processing (NLP) explainability](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/text_explainability/text_explainability.html) – Use SageMaker Clarify for text sentiment analysis.
+ Use computer vision (CV) explainability for [image classification](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/computer_vision/image_classification/explainability_image_classification.html) and [object detection](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/computer_vision/object_detection/object_detection_clarify.html).

These notebooks have been verified to run in Amazon SageMaker Studio Classic. If you need instructions on how to open a notebook in Studio Classic, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md). If you're prompted to choose a kernel, choose **Python 3 (Data Science)**.

# Pre-training Data Bias
<a name="clarify-detect-data-bias"></a>

Algorithmic bias, discrimination, fairness, and related topics have been studied across disciplines such as law, policy, and computer science. A computer system might be considered biased if it discriminates against certain individuals or groups of individuals. The machine learning models powering these applications learn from data and this data could reflect disparities or other inherent biases. For example, the training data may not have sufficient representation of various demographic groups or may contain biased labels. The machine learning models trained on datasets that exhibit these biases could end up learning them and then reproduce or even exacerbate those biases in their predictions. The field of machine learning provides an opportunity to address biases by detecting them and measuring them at each stage of the ML lifecycle. You can use Amazon SageMaker Clarify to determine whether data used for training models encodes any bias

Bias can be measured before training and after training, and monitored against baselines after deploying models to endpoints for inference. Pre-training bias metrics are designed to detect and measure bias in the raw data before it is used to train a model. The metrics used are model-agnostic because they do not depend on any model outputs. However, there are different concepts of fairness that require distinct measures of bias. Amazon SageMaker Clarify provides bias metrics to quantify various fairness criteria.

For additional information about bias metrics, see [Learn How Amazon SageMaker Clarify Helps Detect Bias](https://aws.amazon.com/blogs/machine-learning/learn-how-amazon-sagemaker-clarify-helps-detect-bias) and [Fairness Measures for Machine Learning in Finance](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf).

## Amazon SageMaker Clarify Terms for Bias and Fairness
<a name="clarify-bias-and-fairness-terms"></a>

SageMaker Clarify uses the following terminology to discuss bias and fairness.

**Feature**  
An individual measurable property or characteristic of a phenomenon being observed, contained in a column for tabular data.

**Label**  
Feature that is the target for training a machine learning model. Referred to as the *observed label * or *observed outcome*.

**Predicted label**  
The label as predicted by the model. Also referred to as the *predicted outcome*.

**Sample**  
An observed entity described by feature values and label value, contained in a row for tabular data.

**Dataset**  
A collection of samples.

**Bias**  
An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people.

**Bias metric**  
A function that returns numerical values indicating the level of a potential bias.

**Bias report**  
A collection of bias metrics for a given dataset, or a combination of a dataset and a model.

**Positive label values**  
Label values that are favorable to a demographic group observed in a sample. In other words, designates a sample as having a *positive result*. 

**Negative label values**  
Label values that are unfavorable to a demographic group observed in a sample. In other words, designates a sample as having a *negative result*. 

**Group variable**  
Categorical column of the dataset that is used to form subgroups for the measurement of Conditional Demographic Disparity (CDD). Required only for this metric with regards to Simpson’s paradox.

**Facet**  
A column or feature that contains the attributes with respect to which bias is measured.

**Facet value**  
The feature values of attributes that bias might favor or disfavor.

**Predicted probability**  
The probability, as predicted by the model, of a sample having a positive or negative outcome.

## Sample Notebooks
<a name="clarify-data-bias-sample-notebooks"></a>

Amazon SageMaker Clarify provides the following sample notebook for bias detection:
+ [Explainability and bias detection with Amazon SageMaker Clarify](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.html) – Use SageMaker Clarify to create a processing job for detecting bias and explaining model predictions with feature attributions.

This notebook has been verified to run in Amazon SageMaker Studio only. If you need instructions on how to open a notebook in Amazon SageMaker Studio, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md). If you're prompted to choose a kernel, choose **Python 3 (Data Science)**. 

**Topics**
+ [Amazon SageMaker Clarify Terms for Bias and Fairness](#clarify-bias-and-fairness-terms)
+ [Sample Notebooks](#clarify-data-bias-sample-notebooks)
+ [Pre-training Bias Metrics](clarify-measure-data-bias.md)
+ [Generate Reports for Bias in Pre-training Data in SageMaker Studio](clarify-data-bias-reports-ui.md)

# Pre-training Bias Metrics
<a name="clarify-measure-data-bias"></a>

Measuring bias in ML models is a first step to mitigating bias. Each measure of bias corresponds to a different notion of fairness. Even considering simple concepts of fairness leads to many different measures applicable in various contexts. For example, consider fairness with respect to age, and, for simplicity, that middle-aged and rest of the age groups are the two relevant demographics, referred to as *facets*. In the case of an ML model for lending, we may want small business loans to be issued to equal numbers of both demographics. Or, when processing job applicants, we may want to see equal numbers of members of each demographic hired. However, this approach may assume that equal numbers of both age groups apply to these jobs, so we may want to condition on the number that apply. Further, we may want to consider not whether equal numbers apply, but whether we have equal numbers of qualified applicants. Or, we may consider fairness to be an equal acceptance rate of qualified applicants across both age demographics, or, an equal rejection rate of applicants, or both. You might use datasets with different proportions of data on the attributes of interest. This imbalance can conflate the bias measure you choose. The models might be more accurate in classifying one facet than in the other. Thus, you need to choose bias metrics that are conceptually appropriate for the application and the situation.

We use the following notation to discuss the bias metrics. The conceptual model described here is for binary classification, where events are labeled as having only two possible outcomes in their sample space, referred to as positive (with value 1) and negative (with value 0). This framework is usually extensible to multicategory classification in a straightforward way or to cases involving continuous valued outcomes when needed. In the binary classification case, positive and negative labels are assigned to outcomes recorded in a raw dataset for a favored facet *a* and for a disfavored facet *d*. These labels y are referred to as *observed labels* to distinguish them from the *predicted labels* y' that are assigned by a machine learning model during the training or inferences stages of the ML lifecycle. These labels are used to define probability distributions Pa(y) and Pd(y) for their respective facet outcomes. 
+ labels: 
  + y represents the n observed labels for event outcomes in a training dataset.
  + y' represents the predicted labels for the n observed labels in the dataset by a trained model.
+ outcomes:
  + A positive outcome (with value 1) for a sample, such as an application acceptance.
    + n(1) is the number of observed labels for positive outcomes (acceptances).
    + n'(1) is the number of predicted labels for positive outcomes (acceptances).
  + A negative outcome (with value 0) for a sample, such as an application rejection.
    + n(0) is the number of observed labels for negative outcomes (rejections).
    + n'(0) is the number of predicted labels for negative outcomes (rejections).
+ facet values:
  + facet *a* – The feature value that defines a demographic that bias favors.
    + na is the number of observed labels for the favored facet value: na = na(1) \$1 na(0) the sum of the positive and negative observed labels for the value facet *a*.
    + n'a is the number of predicted labels for the favored facet value: n'a = n'a(1) \$1 n'a(0) the sum of the positive and negative predicted outcome labels for the facet value *a*. Note that n'a = na.
  + facet *d* – The feature value that defines a demographic that bias disfavors.
    + nd is the number of observed labels for the disfavored facet value: nd = nd(1) \$1 nd(0) the sum of the positive and negative observed labels for the facet value *d*. 
    + n'd is the number of predicted labels for the disfavored facet value: n'd = n'd(1) \$1 n'd(0) the sum of the positive and negative predicted labels for the facet value *d*. Note that n'd = nd.
+ probability distributions for outcomes of the labeled facet data outcomes:
  + Pa(y) is the probability distribution of the observed labels for facet *a*. For binary labeled data, this distribution is given by the ratio of the number of samples in facet *a* labeled with positive outcomes to the total number, Pa(y1) = na(1)/ na, and the ratio of the number of samples with negative outcomes to the total number, Pa(y0) = na(0)/ na. 
  + Pd(y) is the probability distribution of the observed labels for facet *d*. For binary labeled data, this distribution is given by the number of samples in facet *d* labeled with positive outcomes to the total number, Pd(y1) = nd(1)/ nd, and the ratio of the number of samples with negative outcomes to the total number, Pd(y0) = nd(0)/ nd. 

Models trained on data biased by demographic disparities might learn and even exacerbate them. To identify bias in the data before expending resources to train models on it, SageMaker Clarify provides data bias metrics that you can compute on raw datasets before training. All of the pretraining metrics are model-agnostic because they do not depend on model outputs and so are valid for any model. The first bias metric examines facet imbalance, but not outcomes. It determines the extent to which the amount of training data is representative across different facets, as desired for the application. The remaining bias metrics compare the distribution of outcome labels in various ways for facets *a* and *d* in the data. The metrics that range over negative values can detect negative bias. The following table contains a cheat sheet for quick guidance and links to the pretraining bias metrics.

Pre-training Bias Metrics


| Bias metric | Description | Example question | Interpreting metric values | 
| --- | --- | --- | --- | 
| [Class Imbalance (CI)](clarify-bias-metric-class-imbalance.md) | Measures the imbalance in the number of members between different facet values. |  Could there be age-based biases due to not having enough data for the demographic outside a middle-aged facet?   |  Normalized range: [-1,\$11] Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Difference in Proportions of Labels (DPL)](clarify-data-bias-metric-true-label-imbalance.md) | Measures the imbalance of positive outcomes between different facet values. | Could there be age-based biases in ML predictions due to biased labeling of facet values in the data? |  Range for normalized binary & multicategory facet labels: [-1,\$11] Range for continuous labels: (-∞, \$1∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Kullback-Leibler Divergence (KL)](clarify-data-bias-metric-kl-divergence.md) | Measures how much the outcome distributions of different facets diverge from each other entropically.  | How different are the distributions for loan application outcomes for different demographic groups? |  Range for binary, multicategory, continuous: [0, \$1∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Jensen-Shannon Divergence (JS)](clarify-data-bias-metric-jensen-shannon-divergence.md)  | Measures how much the outcome distributions of different facets diverge from each other entropically.  | How different are the distributions for loan application outcomes for different demographic groups? |  Range for binary, multicategory, continuous: [0, \$1∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Lp-norm (LP)](clarify-data-bias-metric-lp-norm.md)  | Measures a p-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset. | How different are the distributions for loan application outcomes for different demographics? |  Range for binary, multicategory, continuous: [0, \$1∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Total Variation Distance (TVD)](clarify-data-bias-metric-total-variation-distance.md)  | Measures half of the L1-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset. | How different are the distributions for loan application outcomes for different demographics? |  Range for binary, multicategory, and continuous outcomes: [0, \$1∞) [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 
| [Kolmogorov-Smirnov (KS)](clarify-data-bias-metric-kolmogorov-smirnov.md)  | Measures maximum divergence between outcomes in distributions for different facets in a dataset. | Which college application outcomes manifest the greatest disparities by demographic group? | Range of KS values for binary, multicategory, and continuous outcomes: [0,\$11][\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html) | 
| [Conditional Demographic Disparity (CDD)](clarify-data-bias-metric-cddl.md)  | Measures the disparity of outcomes between different facets as a whole, but also by subgroups. | Do some groups have a larger proportion of rejections for college admission outcomes than their proportion of acceptances? |  Range of CDD: [-1, \$11] [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)  | 

For additional information about bias metrics, see [Fairness Measures for Machine Learning in Finance](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf).

**Topics**
+ [Class Imbalance (CI)](clarify-bias-metric-class-imbalance.md)
+ [Difference in Proportions of Labels (DPL)](clarify-data-bias-metric-true-label-imbalance.md)
+ [Kullback-Leibler Divergence (KL)](clarify-data-bias-metric-kl-divergence.md)
+ [Jensen-Shannon Divergence (JS)](clarify-data-bias-metric-jensen-shannon-divergence.md)
+ [Lp-norm (LP)](clarify-data-bias-metric-lp-norm.md)
+ [Total Variation Distance (TVD)](clarify-data-bias-metric-total-variation-distance.md)
+ [Kolmogorov-Smirnov (KS)](clarify-data-bias-metric-kolmogorov-smirnov.md)
+ [Conditional Demographic Disparity (CDD)](clarify-data-bias-metric-cddl.md)

# Class Imbalance (CI)
<a name="clarify-bias-metric-class-imbalance"></a>

Class imbalance (CI) bias occurs when a facet value *d* has fewer training samples when compared with another facet *a* in the dataset. This is because models preferentially fit the larger facets at the expense of the smaller facets and so can result in a higher training error for facet *d*. Models are also at higher risk of overfitting the smaller data sets, which can cause a larger test error for facet *d*. Consider the example where a machine learning model is trained primarily on data from middle-aged individuals (facet a), it might be less accurate when making predictions involving younger and older people (facet d).

The formula for the (normalized) facet imbalance measure:

        CI = (na - nd)/(na \$1 nd)

Where na is the number of members of facet *a* and nd the number for facet *d*. Its values range over the interval [-1, 1]. 
+ Positive CI values indicate the facet *a* has more training samples in the dataset and a value of 1 indicates the data only contains members of the facet *a*.
+  Values of CI near zero indicate a more equal distribution of members between facets and a value of zero indicates a perfectly equal partition between facets and represents a balanced distribution of samples in the training data.
+ Negative CI values indicate the facet *d* has more training samples in the dataset and a value of -1 indicates the data only contains members of the facet *d*.
+ CI values near either of the extremes values of -1 or 1 are very imbalanced and are at a substantial risk of making biased predictions.

If a significant facet imbalance is found to exist among the facets, you might want to rebalance the sample before proceeding to train models on it.

# Difference in Proportions of Labels (DPL)
<a name="clarify-data-bias-metric-true-label-imbalance"></a>

The difference in proportions of labels (DPL) compares the proportion of observed outcomes with positive labels for facet *d* with the proportion of observed outcomes with positive labels of facet *a* in a training dataset. For example, you could use it to compare the proportion of middle-aged individuals (facet *a*) and other age groups (facet *d*) approved for financial loans. Machine learning models try to mimic the training data decisions as closely as possible. So a machine learning model trained on a dataset with a high DPL is likely to reflect the same imbalance in its future predictions.

The formula for the difference in proportions of labels is as follows:

        DPL = (qa - qd)

Where:
+ qa = na(1)/na is the proportion of facet *a* who have an observed label value of 1. For example, the proportion of a middle-aged demographic who get approved for loans. Here na(1) represents the number of members of facet *a* who get a positive outcome and na the is number of members of facet *a*. 
+ qd = nd(1)/nd is the proportion of facet *d* who have an observed label value of 1. For example, the proportion of people outside the middle-aged demographic who get approved for loans. Here nd(1) represents the number of members of the facet *d* who get a positive outcome and nd the is number of members of the facet *d*. 

If DPL is close enough to 0, then we say that *demographic parity* has been achieved.

For binary and multicategory facet labels, the DPL values range over the interval (-1, 1). For continuous labels, we set a threshold to collapse the labels to binary. 
+ Positive DPL values indicate that facet *a* is has a higher proportion of positive outcomes when compared with facet *d*.
+ Values of DPL near zero indicate a more equal proportion of positive outcomes between facets and a value of zero indicates perfect demographic parity. 
+ Negative DPL values indicate that facet *d* has a higher proportion of positive outcomes when compared with facet *a*.

Whether or not a high magnitude of DPL is problematic varies from one situation to another. In a problematic case, a high-magnitude DPL might be a signal of underlying issues in the data. For example, a dataset with high DPL might reflect historical biases or prejudices against age-based demographic groups that would be undesirable for a model to learn.

# Kullback-Leibler Divergence (KL)
<a name="clarify-data-bias-metric-kl-divergence"></a>

The Kullback-Leibler divergence (KL) measures how much the observed label distribution of facet *a*, Pa(y), diverges from distribution of facet *d*, Pd(y). It is also known as the relative entropy of Pa(y) with respect to Pd(y) and quantifies the amount of information lost when moving from Pa(y) to Pd(y).

The formula for the Kullback-Leibler divergence is as follows: 

        KL(Pa \$1\$1 Pd) = ∑yPa(y)\$1log[Pa(y)/Pd(y)]

It is the expectation of the logarithmic difference between the probabilities Pa(y) and Pd(y), where the expectation is weighted by the probabilities Pa(y). This is not a true distance between the distributions as it is asymmetric and does not satisfy the triangle inequality. The implementation uses natural logarithms, giving KL in units of nats. Using different logarithmic bases gives proportional results but in different units. For example, using base 2 gives KL in units of bits.

For example, assume that a group of applicants for loans have a 30% approval rate (facet *d*) and that the approval rate for other applicants (facet *a*) is 80%. The Kullback-Leibler formula gives you the label distribution divergence of facet *a* from facet *d* as follows:

        KL = 0.8\$1ln(0.8/0.3) \$1 0.2\$1ln(0.2/0.7) = 0.53

There are two terms in the formula here because labels are binary in this example. This measure can be applied to multiple labels in addition to binary ones. For example, in a college admissions scenario, assume an applicant may be assigned one of three category labels: yi = \$1y0, y1, y2\$1 = \$1rejected, waitlisted, accepted\$1. 

Range of values for the KL metric for binary, multicategory, and continuous outcomes is [0, \$1∞).
+ Values near zero mean the outcomes are similarly distributed for the different facets.
+ Positive values mean the label distributions diverge, the more positive the larger the divergence.

# Jensen-Shannon Divergence (JS)
<a name="clarify-data-bias-metric-jensen-shannon-divergence"></a>

The Jensen-Shannon divergence (JS) measures how much the label distributions of different facets diverge from each other entropically. It is based on the Kullback-Leibler divergence, but it is symmetric. 

The formula for the Jensen-Shannon divergence is as follows:

        JS = ½\$1[KL(Pa \$1\$1 P) \$1 KL(Pd \$1\$1 P)]

Where P = ½( Pa \$1 Pd ), the average label distribution across facets *a* and *d*.

The range of JS values for binary, multicategory, continuous outcomes is [0, ln(2)).
+ Values near zero mean the labels are similarly distributed.
+ Positive values mean the label distributions diverge, the more positive the larger the divergence.

This metric indicates whether there is a big divergence in one of the labels across facets. 

# Lp-norm (LP)
<a name="clarify-data-bias-metric-lp-norm"></a>

The Lp-norm (LP) measures the p-norm distance between the facet distributions of the observed labels in a training dataset. This metric is non-negative and so cannot detect reverse bias. 

The formula for the Lp-norm is as follows: 

        Lp(Pa, Pd) = ( ∑y\$1\$1Pa - Pd\$1\$1p)1/p

Where the p-norm distance between the points x and y is defined as follows:

        Lp(x, y) = (\$1x1-y1\$1p \$1 \$1x2-y2\$1p \$1 … \$1\$1xn-yn\$1p)1/p 

The 2-norm is the Euclidean norm. Assume you have an outcome distribution with three categories, for example, yi = \$1y0, y1, y2\$1 = \$1accepted, waitlisted, rejected\$1 in a college admissions multicategory scenario. You take the sum of the squares of the differences between the outcome counts for facets *a* and *d*. The resulting Euclidean distance is calculated as follows:

        L2(Pa, Pd) = [(na(0) - nd(0))2 \$1 (na(1) - nd(1))2 \$1 (na(2) - nd(2))2]1/2

Where: 
+ na(i) is number of the ith category outcomes in facet *a*: for example na(0) is number of facet *a* acceptances.
+ nd(i) is number of the ith category outcomes in facet *d*: for example nd(2) is number of facet *d* rejections.

  The range of LP values for binary, multicategory, and continuous outcomes is [0, √2), where:
  + Values near zero mean the labels are similarly distributed.
  + Positive values mean the label distributions diverge, the more positive the larger the divergence.

# Total Variation Distance (TVD)
<a name="clarify-data-bias-metric-total-variation-distance"></a>

The total variation distance data bias metric (TVD) is half the L1-norm. The TVD is the largest possible difference between the probability distributions for label outcomes of facets *a* and *d*. The L1-norm is the Hamming distance, a metric used compare two binary data strings by determining the minimum number of substitutions required to change one string into another. If the strings were to be copies of each other, it determines the number of errors that occurred when copying. In the bias detection context, TVD quantifies how many outcomes in facet *a* would have to be changed to match the outcomes in facet *d*.

The formula for the Total variation distance is as follows: 

        TVD = ½\$1L1(Pa, Pd)

For example, assume you have an outcome distribution with three categories, yi = \$1y0, y1, y2\$1 = \$1accepted, waitlisted, rejected\$1, in a college admissions multicategory scenario. You take the differences between the counts of facets *a* and *d* for each outcome to calculate TVD. The result is as follows:

        L1(Pa, Pd) = \$1na(0) - nd(0)\$1 \$1 \$1na(1) - nd(1)\$1 \$1 \$1na(2) - nd(2)\$1

Where: 
+ na(i) is number of the ith category outcomes in facet *a*: for example na(0) is number of facet *a* acceptances.
+ nd(i) is number of the ith category outcomes in facet d: for example nd(2) is number of facet *d* rejections.

  The range of TVD values for binary, multicategory, and continuous outcomes is [0, 1), where:
  + Values near zero mean the labels are similarly distributed.
  + Positive values mean the label distributions diverge, the more positive the larger the divergence.

# Kolmogorov-Smirnov (KS)
<a name="clarify-data-bias-metric-kolmogorov-smirnov"></a>

The Kolmogorov-Smirnov bias metric (KS) is equal to the maximum divergence between labels in the distributions for facets *a* and *d* of a dataset. The two-sample KS test implemented by SageMaker Clarify complements the other measures of label imbalance by finding the most imbalanced label. 

The formula for the Kolmogorov-Smirnov metric is as follows: 

        KS = max(\$1Pa(y) - Pd(y)\$1)

For example, assume a group of applicants (facet *a*) to college are rejected, waitlisted, or accepted at 40%, 40%, 20% respectively and that these rates for other applicants (facet *d*) are 20%, 10%, 70%. Then the Kolmogorov-Smirnov bias metric value is as follows:

KS = max(\$10.4-0.2\$1, \$10.4-0.1\$1, \$10.2-0.7\$1) = 0.5

This tells us the maximum divergence between facet distributions is 0.5 and occurs in the acceptance rates. There are three terms in the equation because labels are multiclass of cardinality three.

The range of LP values for binary, multicategory, and continuous outcomes is [0, \$11], where:
+ Values near zero indicate the labels were evenly distributed between facets in all outcome categories. For example, both facets applying for a loan got 50% of the acceptances and 50% of the rejections.
+ Values near one indicate the labels for one outcome were all in one facet. For example, facet *a* got 100% of the acceptances and facet *d* got none.
+ Intermittent values indicate relative degrees of maximum label imbalance.

# Conditional Demographic Disparity (CDD)
<a name="clarify-data-bias-metric-cddl"></a>

The demographic disparity metric (DD) determines whether a facet has a larger proportion of the rejected outcomes in the dataset than of the accepted outcomes. In the binary case where there are two facets, men and women for example, that constitute the dataset, the disfavored one is labelled facet *d* and the favored one is labelled facet *a*. For example, in the case of college admissions, if women applicants comprised 46% of the rejected applicants and comprised only 32% of the accepted applicants, we say that there is *demographic disparity* because the rate at which women were rejected exceeds the rate at which they are accepted. Women applicants are labelled facet *d* in this case. If men applicants comprised 54% of the rejected applicants and 68% of the accepted applicants, then there is not a demographic disparity for this facet as the rate of rejection is less that the rate of acceptance. Men applicants are labelled facet *a* in this case. 

The formula for the demographic disparity for the less favored facet *d* is as follows: 

        DDd = nd(0)/n(0) - nd(1)/n(1) = PdR(y0) - PdA(y1) 

Where: 
+ n(0) = na(0) \$1 nd(0) is the total number of rejected outcomes in the dataset for the favored facet *a* and disadvantaged facet *d*.
+ n(1) = na(1) \$1 nd(1) is the total number of accepted outcomes in the dataset for the favored facet *a* and disadvantaged facet *d*.
+ PdR(y0) is the proportion of rejected outcomes (with value 0) in facet *d*.
+ PdA(y1) is the proportion of accepted outcomes (value 1) in facet *d*.

For the college admission example, the demographic disparity for women is DDd = 0.46 - 0.32 = 0.14. For men DDa = 0.54 - 0.68 = - 0.14.

A conditional demographic disparity (CDD) metric that conditions DD on attributes that define a strata of subgroups on the dataset is needed to rule out Simpson's paradox. The regrouping can provide insights into the cause of apparent demographic disparities for less favored facets. The classic case arose in the case of Berkeley admissions where men were accepted at a higher rate overall than women. The statistics for this case were used in the example calculations of DD. However, when departmental subgroups were examined, women were shown to have higher admission rates than men when conditioned by department. The explanation was that women had applied to departments with lower acceptance rates than men had. Examining the subgrouped acceptance rates revealed that women were actually accepted at a higher rate than men for the departments with lower acceptance rates.

The CDD metric gives a single measure for all of the disparities found in the subgroups defined by an attribute of a dataset by averaging them. It is defined as the weighted average of demographic disparities (DDi) for each of the subgroups, with each subgroup disparity weighted in proportion to the number of observations in contains. The formula for the conditional demographic disparity is as follows:

        CDD = (1/n)\$1∑ini \$1DDi 

Where: 
+ ∑ini = n is the total number of observations and niis the number of observations for each subgroup.
+ DDi = ni(0)/n(0) - ni(1)/n(1) = PiR(y0) - PiA(y1) is the demographic disparity for the ith subgroup.

The demographic disparity for a subgroup (DDi) are the difference between the proportion of rejected outcomes and the proportion of accepted outcomes for each subgroup.

The range of DD values for binary outcomes for the full dataset DDd or for its conditionalized subgroups DDi is [-1, \$11]. 
+ \$11: when there no rejections in facet *a* or subgroup and no acceptances in facet *d* or subgroup
+ Positive values indicate there is a demographic disparity as facet *d* or subgroup has a greater proportion of the rejected outcomes in the dataset than of the accepted outcomes. The higher the value the less favored the facet and the greater the disparity.
+ Negative values indicate there is not a demographic disparity as facet *d* or subgroup has a larger proportion of the accepted outcomes in the dataset than of the rejected outcomes. The lower the value the more favored the facet.
+ -1: when there are no rejections in facet *d* or subgroup and no acceptances in facet *a* or subgroup

If you don't condition on anything then CDD is zero if and only if DPL is zero.

This metric is useful for exploring the concepts of direct and indirect discrimination and of objective justification in EU and UK non-discrimination law and jurisprudence. For additional information, see [Why Fairness Cannot Be Automated](https://arxiv.org/abs/2005.05906). This paper also contains the relevant data and analysis of the Berkeley admissions case that shows how conditionalizing on departmental admission rate subgroups illustrates Simpson's paradox.

# Generate Reports for Bias in Pre-training Data in SageMaker Studio
<a name="clarify-data-bias-reports-ui"></a>

SageMaker Clarify is integrated with Amazon SageMaker Data Wrangler, which can help you identify bias during data preparation without having to write your own code. Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data with Amazon SageMaker Studio. For an overview of the Data Wrangler data prep workflow, see [Prepare ML Data with Amazon SageMaker Data Wrangler](data-wrangler.md).

You specify attributes of interest, such as gender or age, and SageMaker Clarify runs a set of algorithms to detect the presence of bias in those attributes. After the algorithm runs, SageMaker Clarify provides a visual report with a description of the sources and severity of possible bias so that you can plan steps to mitigate. For example, in a financial dataset that contains few examples of business loans to one age group as compared to others, SageMaker AI flags the imbalance so that you can avoid a model that disfavors that age group.

**To analyze and report on data bias**

To get started with Data Wrangler, see [Get Started with Data Wrangler](data-wrangler-getting-started.md).

1. In Amazon SageMaker Studio Classic, from the **Home** (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)) menu in the left panel, navigate to the **Data** node, then choose **Data Wrangler**. This opens the ** Data Wrangler landing page** in Studio Classic. 

1. Choose the **\$1 Import data** button to create a new flow. 

1. In your flow page, from the **Import** tab, choose Amazon S3, navigate to your Amazon S3 bucket, find your dataset, then choose **Import**. 

1. After you have imported your data, on the flow graph in the **Data flow** tab, choose the **\$1** sign to the right of the **Data types** node. 

1. Choose **Add analysis**. 

1. On the **Create Analysis** page, choose **Bias Report** for the **Analysis type**. 

1. Configure the bias report by providing a report **Name**, the column to predict and whether it is a value or threshold, the column to analyze for bias (the facet) and whether it is a value or threshold. 

1. Continue configuring the bias report by choosing the bias metrics.  
![\[Choose the bias metric.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify-data-wrangler-configure-bias-metrics.png)

1. Choose **Check for bias** to generate and view the bias report. Scroll down to view all of the reports.   
![\[Generate and view the bias report.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify-data-wrangler-create-bias-report.png)

1. Choose the caret to the right of each bias metric description to see documentation that can help you interpret the significance of the metric values. 

1. To view a table summary of the bias metric values, choose the **Table** toggle. To save the report, choose **Save** in the lower-right corner of the page. You can see the report on the flow graph in the **Data flow** tab. Double-click on the report to open it. 

# Post-training Data and Model Bias
<a name="clarify-detect-post-training-bias"></a>

Post-training bias analysis can help reveal biases that might have emanated from biases in the data, or from biases introduced by the classification and prediction algorithms. These analyses take into consideration the data, including the labels, and the predictions of a model. You assess performance by analyzing predicted labels or by comparing the predictions with the observed target values in the data with respect to groups with different attributes. There are different notions of fairness, each requiring different bias metrics to measure.

There are legal concepts of fairness that might not be easy to capture because they are hard to detect. For example, the US concept of disparate impact that occurs when a group, referred to as a less favored facet * d*, experiences an adverse effect even when the approach taken appears to be fair. This type of bias might not be due to a machine learning model, but might still be detectable by post-training bias analysis.

Amazon SageMaker Clarify tries to ensure a consistent use of terminology. For a list of terms and their definitions, see [Amazon SageMaker Clarify Terms for Bias and Fairness](clarify-detect-data-bias.md#clarify-bias-and-fairness-terms).

For additional information about post-training bias metrics, see [Learn How Amazon SageMaker Clarify Helps Detect Bias](https://aws.amazon.com/blogs/machine-learning/learn-how-amazon-sagemaker-clarify-helps-detect-bias/) and [Fairness Measures for Machine Learning in Finance.](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf).

# Post-training Data and Model Bias Metrics
<a name="clarify-measure-post-training-bias"></a>

Amazon SageMaker Clarify provides eleven post-training data and model bias metrics to help quantify various conceptions of fairness. These concepts cannot all be satisfied simultaneously and the selection depends on specifics of the cases involving potential bias being analyzed. Most of these metrics are a combination of the numbers taken from the binary classification confusion matrices for the different demographic groups. Because fairness and bias can be defined by a wide range of metrics, human judgment is required to understand and choose which metrics are relevant to the individual use case, and customers should consult with appropriate stakeholders to determine the appropriate measure of fairness for their application.

We use the following notation to discuss the bias metrics. The conceptual model described here is for binary classification, where events are labeled as having only two possible outcomes in their sample space, referred to as positive (with value 1) and negative (with value 0). This framework is usually extensible to multicategory classification in a straightforward way or to cases involving continuous valued outcomes when needed. In the binary classification case, positive and negative labels are assigned to outcomes recorded in a raw dataset for a favored facet *a* and for a disfavored facet *d*. These labels y are referred to as *observed labels* to distinguish them from the *predicted labels* y' that are assigned by a machine learning model during the training or inferences stages of the ML lifecycle. These labels are used to define probability distributions Pa(y) and Pd(y) for their respective facet outcomes. 
+ labels: 
  + y represents the n observed labels for event outcomes in a training dataset.
  + y' represents the predicted labels for the n observed labels in the dataset by a trained model.
+ outcomes:
  + A positive outcome (with value 1) for a sample, such as an application acceptance.
    + n(1) is the number of observed labels for positive outcomes (acceptances).
    + n'(1) is the number of predicted labels for positive outcomes (acceptances).
  + A negative outcome (with value 0) for a sample, such as an application rejection.
    + n(0) is the number of observed labels for negative outcomes (rejections).
    + n'(0) is the number of predicted labels for negative outcomes (rejections).
+ facet values:
  + facet *a* – The feature value that defines a demographic that bias favors.
    + na is the number of observed labels for the favored facet value: na = na(1) \$1 na(0) the sum of the positive and negative observed labels for the value facet *a*.
    + n'a is the number of predicted labels for the favored facet value: n'a = n'a(1) \$1 n'a(0) the sum of the positive and negative predicted outcome labels for the facet value *a*. Note that n'a = na.
  + facet *d* – The feature value that defines a demographic that bias disfavors.
    + nd is the number of observed labels for the disfavored facet value: nd = nd(1) \$1 nd(0) the sum of the positive and negative observed labels for the facet value *d*. 
    + n'd is the number of predicted labels for the disfavored facet value: n'd = n'd(1) \$1 n'd(0) the sum of the positive and negative predicted labels for the facet value *d*. Note that n'd = nd.
+ probability distributions for outcomes of the labeled facet data outcomes:
  + Pa(y) is the probability distribution of the observed labels for facet *a*. For binary labeled data, this distribution is given by the ratio of the number of samples in facet *a* labeled with positive outcomes to the total number, Pa(y1) = na(1)/ na, and the ratio of the number of samples with negative outcomes to the total number, Pa(y0) = na(0)/ na. 
  + Pd(y) is the probability distribution of the observed labels for facet *d*. For binary labeled data, this distribution is given by the number of samples in facet *d* labeled with positive outcomes to the total number, Pd(y1) = nd(1)/ nd, and the ratio of the number of samples with negative outcomes to the total number, Pd(y0) = nd(0)/ nd. 

The following table contains a cheat sheet for quick guidance and links to the post-training bias metrics.

Post-training bias metrics


| Post-training bias metric | Description | Example question | Interpreting metric values | 
| --- | --- | --- | --- | 
| [Difference in Positive Proportions in Predicted Labels (DPPL)](clarify-post-training-bias-metric-dppl.md) | Measures the difference in the proportion of positive predictions between the favored facet a and the disfavored facet d. |  Has there been an imbalance across demographic groups in the predicted positive outcomes that might indicate bias?  |  Range for normalized binary & multicategory facet labels: `[-1,+1]` Range for continuous labels: (-∞, \$1∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Disparate Impact (DI)](clarify-post-training-bias-metric-di.md) | Measures the ratio of proportions of the predicted labels for the favored facet a and the disfavored facet d. | Has there been an imbalance across demographic groups in the predicted positive outcomes that might indicate bias? |  Range for normalized binary, multicategory facet, and continuous labels: [0,∞) Interpretation: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Conditional Demographic Disparity in Predicted Labels (CDDPL)](clarify-post-training-bias-metric-cddpl.md)  | Measures the disparity of predicted labels between the facets as a whole, but also by subgroups. | Do some demographic groups have a larger proportion of rejections for loan application outcomes than their proportion of acceptances? |  The range of CDDPL values for binary, multicategory, and continuous outcomes: `[-1, +1]` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Counterfactual Fliptest (FT)](clarify-post-training-bias-metric-ft.md)  | Examines each member of facet d and assesses whether similar members of facet a have different model predictions. | Is one group of a specific-age demographic matched closely on all features with a different age group, yet paid more on average? | The range for binary and multicategory facet labels is [-1, \$11]. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Accuracy Difference (AD)](clarify-post-training-bias-metric-ad.md)  | Measures the difference between the prediction accuracy for the favored and disfavored facets.  | Does the model predict labels as accurately for applications across all demographic groups? | The range for binary and multicategory facet labels is [-1, \$11].[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Recall Difference (RD)](clarify-post-training-bias-metric-rd.md)  | Compares the recall of the model for the favored and disfavored facets.  | Is there an age-based bias in lending due to a model having higher recall for one age group as compared to another? |  Range for binary and multicategory classification: `[-1, +1]`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Difference in Conditional Acceptance (DCAcc)](clarify-post-training-bias-metric-dcacc.md)  | Compares the observed labels to the labels predicted by a model. Assesses whether this is the same across facets for predicted positive outcomes (acceptances).  | When comparing one age group to another, are loans accepted more frequently, or less often than predicted (based on qualifications)? |  The range for binary, multicategory facet, and continuous labels: (-∞, \$1∞). [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Difference in Acceptance Rates (DAR)](clarify-post-training-bias-metric-dar.md)  | Measures the difference in the ratios of the observed positive outcomes (TP) to the predicted positives (TP \$1 FP) between the favored and disfavored facets. | Does the model have equal precision when predicting loan acceptances for qualified applicants across all age groups? | The range for binary, multicategory facet, and continuous labels is [-1, \$11].[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Specificity difference (SD)](clarify-post-training-bias-metric-sd.md)  | Compares the specificity of the model between favored and disfavored facets.  | Is there an age-based bias in lending because the model predicts a higher specificity for one age group as compared to another? |  Range for binary and multicategory classification: `[-1, +1]`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html)  | 
| [Difference in Conditional Rejection (DCR)](clarify-post-training-bias-metric-dcr.md)  | Compares the observed labels to the labels predicted by a model and assesses whether this is the same across facets for negative outcomes (rejections). | Are there more or less rejections for loan applications than predicted for one age group as compared to another based on qualifications? | The range for binary, multicategory facet, and continuous labels: (-∞, \$1∞).[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Difference in Rejection Rates (DRR)](clarify-post-training-bias-metric-drr.md)  | Measures the difference in the ratios of the observed negative outcomes (TN) to the predicted negatives (TN \$1 FN) between the disfavored and favored facets. | Does the model have equal precision when predicting loan rejections for unqualified applicants across all age groups? | The range for binary, multicategory facet, and continuous labels is [-1, \$11].[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Treatment Equality (TE)](clarify-post-training-bias-metric-te.md)  | Measures the difference in the ratio of false positives to false negatives between the favored and disfavored facets. | In loan applications, is the relative ratio of false positives to false negatives the same across all age demographics?  | The range for binary and multicategory facet labels: (-∞, \$1∞).[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 
| [Generalized entropy (GE)](clarify-post-training-bias-metric-ge.md)  | Measures the inequality in benefits b assigned to each input by the model predictions. | Of two candidate models for loan application classification, does one lead to a more uneven distribution of desired outcomes than the other? | The range for binary and multicategory labels: (0, 0.5). GE is undefined when the model predicts only false negatives.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) | 

For additional information about post-training bias metrics, see [A Family of Fairness Measures for Machine Learning in Finance](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf).

**Topics**
+ [Difference in Positive Proportions in Predicted Labels (DPPL)](clarify-post-training-bias-metric-dppl.md)
+ [Disparate Impact (DI)](clarify-post-training-bias-metric-di.md)
+ [Difference in Conditional Acceptance (DCAcc)](clarify-post-training-bias-metric-dcacc.md)
+ [Difference in Conditional Rejection (DCR)](clarify-post-training-bias-metric-dcr.md)
+ [Specificity difference (SD)](clarify-post-training-bias-metric-sd.md)
+ [Recall Difference (RD)](clarify-post-training-bias-metric-rd.md)
+ [Difference in Acceptance Rates (DAR)](clarify-post-training-bias-metric-dar.md)
+ [Difference in Rejection Rates (DRR)](clarify-post-training-bias-metric-drr.md)
+ [Accuracy Difference (AD)](clarify-post-training-bias-metric-ad.md)
+ [Treatment Equality (TE)](clarify-post-training-bias-metric-te.md)
+ [Conditional Demographic Disparity in Predicted Labels (CDDPL)](clarify-post-training-bias-metric-cddpl.md)
+ [Counterfactual Fliptest (FT)](clarify-post-training-bias-metric-ft.md)
+ [Generalized entropy (GE)](clarify-post-training-bias-metric-ge.md)

# Difference in Positive Proportions in Predicted Labels (DPPL)
<a name="clarify-post-training-bias-metric-dppl"></a>

The difference in positive proportions in predicted labels (DPPL) metric determines whether the model predicts outcomes differently for each facet. It is defined as the difference between the proportion of positive predictions (y’ = 1) for facet *a* and the proportion of positive predictions (y’ = 1) for facet *d*. For example, if the model predictions grant loans to 60% of a middle-aged group (facet *a*) and 50% other age groups (facet *d*), it might be biased against facet *d*. In this example, you must determine whether the 10% difference is material to a case for bias. 

A comparison of difference in proportions of labels (DPL), a measure of pre-training bias, with DPPL, a measure of post-training bias, assesses whether bias in positive proportions that are initially present in the dataset changes after training. If DPPL is larger than DPL, then bias in positive proportions increased after training. If DPPL is smaller than DPL, the model did not increase bias in positive proportions after training. Comparing DPL against DPPL does not guarantee that the model reduces bias along all dimensions. For example, the model may still be biased when considering other metrics such as [Counterfactual Fliptest (FT)](clarify-post-training-bias-metric-ft.md) or [Accuracy Difference (AD)](clarify-post-training-bias-metric-ad.md). For more information about bias detection, see the blog post [Learn how Amazon SageMaker Clarify helps detect bias](https://aws.amazon.com/blogs/machine-learning/learn-how-amazon-sagemaker-clarify-helps-detect-bias/). See [Difference in Proportions of Labels (DPL)](clarify-data-bias-metric-true-label-imbalance.md) for more information about DPL.

The formula for the DPPL is:



        DPPL = q'a - q'd

Where:
+ q'a = n'a(1)/na is the predicted proportion of facet *a* who get a positive outcome of value 1. In our example, the proportion of a middle-aged facet predicted to get granted a loan. Here n'a(1) represents the number of members of facet *a* who get a positive predicted outcome of value 1 and na the is number of members of facet *a*. 
+ q'd = n'd(1)/nd is the predicted proportion of facet *d* who get a positive outcome of value 1. In our example, a facet of older and younger people predicted to get granted a loan. Here n'd(1) represents the number of members of facet *d* who get a positive predicted outcome and nd the is number of members of facet *d*. 

If DPPL is close enough to 0, it means that post-training *demographic parity* has been achieved.

For binary and multicategory facet labels, the normalized DPL values range over the interval [-1, 1]. For continuous labels, the values vary over the interval (-∞, \$1∞). 
+ Positive DPPL values indicate that facet *a* has a higher proportion of predicted positive outcomes when compared with facet *d*. 

  This is referred to as *positive bias*.
+ Values of DPPL near zero indicate a more equal proportion of predicted positive outcomes between facets *a* and *d* and a value of zero indicates perfect demographic parity. 
+ Negative DPPL values indicate that facet *d* has a higher proportion of predicted positive outcomes when compared with facet *a*. This is referred to as *negative bias*.

# Disparate Impact (DI)
<a name="clarify-post-training-bias-metric-di"></a>

The difference in positive proportions in the predicted labels metric can be assessed in the form of a ratio.

The comparison of positive proportions in predicted labels metric can be assessed in the form of a ratio instead of as a difference, as it is with the [Difference in Positive Proportions in Predicted Labels (DPPL)](clarify-post-training-bias-metric-dppl.md). The disparate impact (DI) metric is defined as the ratio of the proportion of positive predictions (y’ = 1) for facet *d* over the proportion of positive predictions (y’ = 1) for facet *a*. For example, if the model predictions grant loans to 60% of a middle-aged group (facet *a*) and 50% other age groups (facet *d*), then DI = .5/.6 = 0.8, which indicates a positive bias and an adverse impact on the other aged group represented by facet *d*.

The formula for the ratio of proportions of the predicted labels:



        DI = q'd/q'a

Where:
+ q'a = n'a(1)/na is the predicted proportion of facet *a* who get a positive outcome of value 1. In our example, the proportion of a middle-aged facet predicted to get granted a loan. Here n'a(1) represents the number of members of facet *a* who get a positive predicted outcome and na the is number of members of facet *a*. 
+ q'd = n'd(1)/nd is the predicted proportion of facet *d* a who get a positive outcome of value 1. In our example, a facet of older and younger people predicted to get granted a loan. Here n'd(1) represents the number of members of facet *d* who get a positive predicted outcome and nd the is number of members of facet *d*. 

For binary, multicategory facet, and continuous labels, the DI values range over the interval [0, ∞).
+ Values less than 1 indicate that facet *a* has a higher proportion of predicted positive outcomes than facet *d*. This is referred to as *positive bias*.
+ A value of 1 indicates demographic parity. 
+ Values greater than 1 indicate that facet *d* has a higher proportion of predicted positive outcomes than facet *a*. This is referred to as *negative bias*.

# Difference in Conditional Acceptance (DCAcc)
<a name="clarify-post-training-bias-metric-dcacc"></a>

This metric compares the observed labels to the labels predicted by the model and assesses whether this is the same across facets for predicted positive outcomes. This metric comes close to mimicking human bias in that it quantifies how many more positive outcomes a model predicted (labels y’) for a certain facet as compared to what was observed in the training dataset (labels y). For example, if there were more acceptances (a positive outcome) observed in the training dataset for loan applications for a middle-aged group (facet *a*) than predicted by the model based on qualifications as compared to the facet containing other age groups (facet *d*), this might indicate potential bias in the way loans were approved favoring the middle-aged group. 

The formula for the difference in conditional acceptance:

        DCAcc = ca - cd

Where:
+ ca = na(1)/ n'a(1) is the ratio of the observed number of positive outcomes of value 1 (acceptances) of facet *a* to the predicted number of positive outcome (acceptances) for facet *a*. 
+ cd = nd(1)/ n'd(1) is the ratio of the observed number of positive outcomes of value 1 (acceptances) of facet *d* to the predicted number of predicted positive outcomes (acceptances) for facet *d*. 

The DCAcc metric can capture both positive and negative biases that reveal preferential treatment based on qualifications. Consider the following instances of age-based bias on loan acceptances.

**Example 1: Positive bias** 

Suppose we have dataset of 100 middle-aged people (facet *a*) and 50 people from other age groups (facet *d*) who applied for loans, where the model recommended that 60 from facet *a* and 30 from facet *d* be given loans. So the predicted proportions are unbiased with respect to the DPPL metric, but the observed labels show that 70 from facet *a* and 20 from facet *d* were granted loans. In other words, the model granted loans to 17% fewer from the middle aged facet than the observed labels in the training data suggested (70/60 = 1.17) and granted loans to 33% more from other age groups than the observed labels suggested (20/30 = 0.67). The calculation of the DCAcc value gives the following:

        DCAcc = 70/60 - 20/30 = 1/2

The positive value indicates that there is a potential bias against the middle-aged facet *a* with a lower acceptance rate as compared with the other facet *d* than the observed data (taken as unbiased) indicate is the case.

**Example 2: Negative bias** 

Suppose we have dataset of 100 middle-aged people (facet *a*) and 50 people from other age groups (facet *d*) who applied for loans, where the model recommended that 60 from facet *a* and 30 from facet *d* be given loans. So the predicted proportions are unbiased with respect to the DPPL metric, but the observed labels show that 50 from facet *a* and 40 from facet *d* were granted loans. In other words, the model granted loans to 17% fewer from the middle aged facet than the observed labels in the training data suggested (50/60 = 0.83), and granted loans to 33% more from other age groups than the observed labels suggested (40/30 = 1.33). The calculation of the DCAcc value gives the following:

        DCAcc = 50/60 - 40/30 = -1/2

The negative value indicates that there is a potential bias against facet *d* with a lower acceptance rate as compared with the middle-aged facet *a* than the observed data (taken as unbiased) indicate is the case.

Note that you can use DCAcc to help you detect potential (unintentional) biases by humans overseeing the model predictions in a human-in-the-loop setting. Assume, for example, that the predictions y' by the model were unbiased, but the eventual decision is made by a human (possibly with access to additional features) who can alter the model predictions to generate a new and final version of y'. The additional processing by the human may unintentionally deny loans to a disproportionate number from one facet. DCAcc can help detect such potential biases.

The range of values for differences in conditional acceptance for binary, multicategory facet, and continuous labels is (-∞, \$1∞).
+ Positive values occur when the ratio of the observed number of acceptances compared to predicted acceptances for facet *a* is higher than the same ratio for facet *d*. These values indicate a possible bias against the qualified applicants from facet *a*. The larger the difference of the ratios, the more extreme the apparent bias.
+ Values near zero occur when the ratio of the observed number of acceptances compared to predicted acceptances for facet *a* is the similar to the ratio for facet *d*. These values indicate that predicted acceptance rates are consistent with the observed values in the labeled data and that qualified applicants from both facets are being accepted in a similar way. 
+ Negative values occur when the ratio of the observed number of acceptances compared to predicted acceptances for facet *a* is less than that ratio for facet *d*. These values indicate a possible bias against the qualified applicants from facet *d*. The more negative the difference in the ratios, the more extreme the apparent bias.

# Difference in Conditional Rejection (DCR)
<a name="clarify-post-training-bias-metric-dcr"></a>

This metric compares the observed labels to the labels predicted by the model and assesses whether this is the same across facets for negative outcomes (rejections). This metric comes close to mimicking human bias, in that it quantifies how many more negative outcomes a model granted (predicted labels y’) to a certain facet as compared to what was suggested by the labels in the training dataset (observed labels y). For example, if there were more observed rejections (a negative outcome) for loan applications for a middle-aged group (facet *a*) than predicted by the model based on qualifications as compared to the facet containing other age groups (facet *d*), this might indicate potential bias in the way loans were rejected that favored the middle-aged group over other groups.

The formula for the difference in conditional acceptance:

        DCR = rd - ra

Where:
+ rd = nd(0)/ n'd(0) is the ratio of the observed number of negative outcomes of value 0 (rejections) of facet *d* to the predicted number of negative outcome (rejections) for facet *d*. 
+ ra = na(0)/ n'a(0) is the ratio of the observed number of negative outcomes of value 0 (rejections) of facet *a* to the predicted number of negative outcome of value 0 (rejections) for facet *a*. 

The DCR metric can capture both positive and negative biases that reveal preferential treatment based on qualifications. Consider the following instances of age-based bias on loan rejections.

**Example 1: Positive bias** 

Suppose we have dataset of 100 middle-aged people (facet *a*) and 50 people from other age groups (facet *d*) who applied for loans, where the model recommended that 60 from facet *a* and 30 from facet *d* be rejected for loans. So the predicted proportions are unbiased by the DPPL metric, but the observed labels show that 50 from facet *a* and 40 from facet *d* were rejected. In other words, the model rejected 17% more loans from the middle aged facet than the observed labels in the training data suggested (50/60 = 0.83), and rejected 33% fewer loans from other age groups than the observed labels suggested (40/30 = 1.33). The DCR value quantifies this difference in the ratio of observed to predicted rejection rates between the facets. The positive value indicates that there is a potential bias favoring the middle aged group with lower rejection rates as compared with other groups than the observed data (taken as unbiased) indicate is the case.

        DCR = 40/30 - 50/60 = 1/2

**Example 2: Negative bias** 

Suppose we have dataset of 100 middle-aged people (facet *a*) and 50 people from other age groups (facet *d*) who applied for loans, where the model recommended that 60 from facet *a* and 30 from facet *d* be rejected for loans. So the predicted proportions are unbiased by the DPPL metric, but the observed labels show that 70 from facet *a* and 20 from facet *d* were rejected. In other words, the model rejected 17% fewer loans from the middle aged facet than the observed labels in the training data suggested (70/60 = 1.17), and rejected 33% more loans from other age groups than the observed labels suggested (20/30 = 0.67). The negative value indicates that there is a potential bias favoring facet *a* with lower rejection rates as compared with the middle-aged facet *a* than the observed data (taken as unbiased) indicate is the case.

        DCR = 20/30 - 70/60 = -1/2

The range of values for differences in conditional rejection for binary, multicategory facet, and continuous labels is (-∞, \$1∞).
+ Positive values occur when the ratio of the observed number of rejections compared to predicted rejections for facet *d* is greater than that ratio for facet *a*. These values indicate a possible bias against the qualified applicants from facet *a*. The larger the value of DCR metric, the more extreme the apparent bias.
+ Values near zero occur when the ratio of the observed number of rejections compared to predicted acceptances for facet *a* is the similar to the ratio for facet *d*. These values indicate that predicted rejections rates are consistent with the observed values in the labeled data and that the qualified applicants from both facets are being rejected in a similar way. 
+ Negative values occur when the ratio of the observed number of rejections compared to predicted rejections for facet *d* is less than that ratio facet *a*. These values indicate a possible bias against the qualified applicants from facet *d*. The larger magnitude of the negative DCR metric, the more extreme the apparent bias.

 

# Specificity difference (SD)
<a name="clarify-post-training-bias-metric-sd"></a>

The specificity difference (SD) is the difference in specificity between the favored facet *a* and disfavored facet *d*. Specificity measures how often the model correctly predicts a negative outcome (y'=0). Any difference in these specificities is a potential form of bias. 

Specificity is perfect for a facet if all of the y=0 cases are correctly predicted for that facet. Specificity is greater when the model minimizes false positives, known as a Type I error. For example, the difference between a low specificity for lending to facet *a*, and high specificity for lending to facet *d*, is a measure of bias against facet *d*.

The following formula is for the difference in the specificity for facets *a* and *d*.

        SD = TNd/(TNd \$1 FPd) - TNa/(TNa \$1 FPa) = TNRd - TNRa

The following variables used to calculated SD are defined as follows:
+ TNd are the true negatives predicted for facet *d*.
+ FPd are the false positives predicted for facet *d*.
+ TNd are the true negatives predicted for facet *a*.
+ FPd are the false positives predicted for facet *a*.
+ TNRa = TNa/(TNa \$1 FPa) is the true negative rate, also known as the specificity, for facet *a*.
+ TNRd = TNd/(TNd \$1 FPd) is the true negative rate, also known as the specificity, for facet *d*.

For example, consider the following confusion matrices for facets *a* and *d*.

Confusion matrix for the favored facet `a`


| Class a predictions | Actual outcome 0 | Actual outcome 1 | Total  | 
| --- | --- | --- | --- | 
| 0 | 20 | 5 | 25 | 
| 1 | 10 | 65 | 75 | 
| Total | 30 | 70 | 100 | 

Confusion matrix for the disfavored facet `d`


| Class d predictions | Actual outcome 0 | Actual outcome 1 | Total  | 
| --- | --- | --- | --- | 
| 0 | 18 | 7 | 25 | 
| 1 | 5 | 20 | 25 | 
| Total | 23 | 27 | 50 | 

The value of the specificity difference is `SD = 18/(18+5) - 20/(20+10) = 0.7826 - 0.6667 = 0.1159`, which indicates a bias against facet *d*.

The range of values for the specificity difference between facets *a* and *d* for binary and multicategory classification is `[-1, +1]`. This metric is not available for the case of continuous labels. Here is what different values of SD imply:
+ Positive values are obtained when there is higher specificity for facet *d* than for facet *a*. This suggests that the model finds less false positives for facet *d* than for facet *a*. A positive value indicates bias against facet *d*. 
+ Values near zero indicate that the specificity for facets that are being compared is similar. This suggests that the model finds a similar number of false positives in both of these facets and is not biased.
+ Negative values are obtained when there is higher specificity for facet *a* than for facet *d*. This suggests that the model finds more false positives for facet *a* than for facet *d*. A negative value indicates bias against facet *a*. 

# Recall Difference (RD)
<a name="clarify-post-training-bias-metric-rd"></a>

The recall difference (RD) metric is the difference in recall of the model between the favored facet *a* and disfavored facet *d*. Any difference in these recalls is a potential form of bias. Recall is the true positive rate (TPR), which measures how often the model correctly predicts the cases that should receive a positive outcome. Recall is perfect for a facet if all of the y=1 cases are correctly predicted as y’=1 for that facet. Recall is greater when the model minimizes false negatives known as the Type II error. For example, how many of the people in two different groups (facets *a* and *d*) that should qualify for loans are detected correctly by the model? If the recall rate is high for lending to facet *a*, but low for lending to facet *d*, the difference provides a measure of this bias against the group belonging to facet *d*. 

The formula for difference in the recall rates for facets *a* and *d*:

        RD = TPa/(TPa \$1 FNa) - TPd/(TPd \$1 FNd) = TPRa - TPRd 

Where:
+ TPa are the true positives predicted for facet *a*.
+ FNa are the false negatives predicted for facet *a*.
+ TPd are the true positives predicted for facet *d*.
+ FNd are the false negatives predicted for facet *d*.
+ TPRa = TPa/(TPa \$1 FNa) is the recall for facet *a*, or its true positive rate.
+ TPRd TPd/(TPd \$1 FNd) is the recall for facet *d*, or its true positive rate.

For example, consider the following confusion matrices for facets *a* and *d*.

Confusion Matrix for the Favored Facet a


| Class a predictions | Actual outcome 0 | Actual outcome 1 | Total  | 
| --- | --- | --- | --- | 
| 0 | 20 | 5 | 25 | 
| 1 | 10 | 65 | 75 | 
| Total | 30 | 70 | 100 | 

Confusion Matrix for the Disfavored Facet d


| Class d predictions | Actual outcome 0 | Actual outcome 1 | Total  | 
| --- | --- | --- | --- | 
| 0 | 18 | 7 | 25 | 
| 1 | 5 | 20 | 25 | 
| Total | 23 | 27 | 50 | 

The value of the recall difference is RD = 65/70 - 20/27 = 0.93 - 0.74 = 0.19 which indicates a bias against facet *d*.

The range of values for the recall difference between facets *a* and *d* for binary and multicategory classification is [-1, \$11]. This metric is not available for the case of continuous labels.
+ Positive values are obtained when there is higher recall for facet *a* than for facet *d*. This suggests that the model finds more of the true positives for facet *a* than for facet *d*, which is a form of bias. 
+ Values near zero indicate that the recall for facets being compared is similar. This suggests that the model finds about the same number of true positives in both of these facets and is not biased.
+ Negative values are obtained when there is higher recall for facet *d* than for facet *a*. This suggests that the model finds more of the true positives for facet *d* than for facet *a*, which is a form of bias. 

# Difference in Acceptance Rates (DAR)
<a name="clarify-post-training-bias-metric-dar"></a>

The difference in acceptance rates (DAR) metric is the difference in the ratios of the true positive (TP) predictions to the observed positives (TP \$1 FP) for facets *a* and *d*. This metric measures the difference in the precision of the model for predicting acceptances from these two facets. Precision measures the fraction of qualified candidates from the pool of qualified candidates that are identified as such by the model. If the model precision for predicting qualified applicants diverges between the facets, this is a bias and its magnitude is measured by the DAR.

The formula for difference in acceptance rates between facets *a* and *d*:

        DAR = TPa/(TPa \$1 FPa) - TPd/(TPd \$1 FPd) 

Where:
+ TPa are the true positives predicted for facet *a*.
+ FPa are the false positives predicted for facet *a*.
+ TPd are the true positives predicted for facet *d*.
+ FPd are the false positives predicted for facet *d*.

For example, suppose the model accepts 70 middle-aged applicants (facet *a*) for a loan (predicted positive labels) of whom only 35 are actually accepted (observed positive labels). Also suppose the model accepts 100 applicants from other age demographics (facet *d*) for a loan (predicted positive labels) of whom only 40 are actually accepted (observed positive labels). Then DAR = 35/70 - 40/100 = 0.10, which indicates a potential bias against qualified people from the second age group (facet *d*).

The range of values for DAR for binary, multicategory facet, and continuous labels is [-1, \$11].
+ Positive values occur when the ratio of the predicted positives (acceptances) to the observed positive outcomes (qualified applicants) for facet *a* is larger than the same ratio for facet *d*. These values indicate a possible bias against the disfavored facet *d* caused by the occurrence of relatively more false positives in facet *d*. The larger the difference in the ratios, the more extreme the apparent bias.
+ Values near zero occur when the ratio of the predicted positives (acceptances) to the observed positive outcomes (qualified applicants) for facets *a* and *d* have similar values indicating the observed labels for positive outcomes are being predicted with equal precision by the model.
+ Negative values occur when the ratio of the predicted positives (acceptances) to the observed positive outcomes (qualified applicants) for facet *d* is larger than the ratio facet *a*. These values indicate a possible bias against the favored facet *a* caused by the occurrence of relatively more false positives in facet *a*. The more negative the difference in the ratios, the more extreme the apparent bias.

# Difference in Rejection Rates (DRR)
<a name="clarify-post-training-bias-metric-drr"></a>

The difference in rejection rates (DRR) metric is the difference in the ratios of the true negative (TN) predictions to the observed negatives (TN \$1 FN) for facets *a* and *d*. This metric measures the difference in the precision of the model for predicting rejections from these two facets. Precision measures the fraction of unqualified candidates from the pool of unqualified candidates that are identified as such by the model. If the model precision for predicting unqualified applicants diverges between the facets, this is a bias and its magnitude is measured by the DRR.

The formula for difference in rejection rates between facets *a* and *d*:

        DRR = TNd/(TNd \$1 FNd) - TNa/(TNa \$1 FNa) 

The components for the previous DRR equation are as follows.
+ TNd are the true negatives predicted for facet *d*.
+ FNd are the false negatives predicted for facet *d*.
+ TPa are the true negatives predicted for facet *a*.
+ FNa are the false negatives predicted for facet *a*.

For example, suppose the model rejects 100 middle-aged applicants (facet *a*) for a loan (predicted negative labels) of whom 80 are actually unqualified (observed negative labels). Also suppose the model rejects 50 applicants from other age demographics (facet *d*) for a loan (predicted negative labels) of whom only 40 are actually unqualified (observed negative labels). Then DRR = 40/50 - 80/100 = 0, so no bias is indicated.

The range of values for DRR for binary, multicategory facet, and continuous labels is [-1, \$11].
+ Positive values occur when the ratio of the predicted negatives (rejections) to the observed negative outcomes (unqualified applicants) for facet *d* is larger than the same ratio for facet *a*. These values indicate a possible bias against the favored facet *a* caused by the occurrence of relatively more false negatives in facet *a*. The larger the difference in the ratios, the more extreme the apparent bias.
+ Values near zero occur when the ratio of the predicted negatives (rejections) to the observed negative outcomes (unqualified applicants) for facets *a* and *d* have similar values, indicating the observed labels for negative outcomes are being predicted with equal precision by the model.
+ Negative values occur when the ratio of the predicted negatives (rejections) to the observed negative outcomes (unqualified applicants) for facet *a* is larger than the ratio facet *d*. These values indicate a possible bias against the disfavored facet *d* caused by the occurrence of relatively more false positives in facet *d*. The more negative the difference in the ratios, the more extreme the apparent bias.

# Accuracy Difference (AD)
<a name="clarify-post-training-bias-metric-ad"></a>

Accuracy difference (AD) metric is the difference between the prediction accuracy for different facets. This metric determines whether the classification by the model is more accurate for one facet than the other. AD indicates whether one facet incurs a greater proportion of Type I and Type II errors. But it cannot differentiate between Type I and Type II errors. For example, the model may have equal accuracy for different age demographics, but the errors may be mostly false positives (Type I errors) for one age-based group and mostly false negatives (Type II errors) for the other. 

Also, if loan approvals are made with much higher accuracy for a middle-aged demographic (facet *a*) than for another age-based demographic (facet *d*), either a greater proportion of qualified applicants in the second group are denied a loan (FN) or a greater proportion of unqualified applicants from that group get a loan (FP) or both. This can lead to within group unfairness for the second group, even if the proportion of loans granted is nearly the same for both age-based groups, which is indicated by a DPPL value that is close to zero.

The formula for AD metric is the difference between the prediction accuracy for facet *a*, ACCa, minus that for facet *d*, ACCd:

        AD = ACCa - ACCd

Where:
+ ACCa = (TPa \$1 TNa)/(TPa \$1 TNa \$1 FPa \$1 FNa) 
  + TPa are the true positives predicted for facet *a*
  + TNa are the true negatives predicted for facet *a*
  + FPa are the false positives predicted for facet *a*
  + FNa are the false negatives predicted for facet *a*
+ ACCd = (TPd \$1 TNd)/(TPd \$1 TNd \$1 FPd \$1 FNd)
  + TPd are the true positives predicted for facet *d*
  + TNd are the true negatives predicted for facet *d*
  + FPd are the false positives predicted for facet *d*
  + FNd are the false negatives predicted for facet *d*

For example, suppose a model approves loans to 70 applicants from facet *a* of 100 and rejected the other 30. 10 should not have been offered the loan (FPa) and 60 were approved that should have been (TPa). 20 of the rejections should have been approved (FNa) and 10 were correctly rejected (TNa). The accuracy for facet *a* is as follows:

        ACCa = (60 \$1 10)/(60 \$1 10 \$1 20 \$1 10) = 0.7

Next, suppose a model approves loans to 50 applicants from facet *d* of 100 and rejected the other 50. 10 should not have been offered the loan (FPa) and 40 were approved that should have been (TPa). 40 of the rejections should have been approved (FNa) and 10 were correctly rejected (TNa). The accuracy for facet *a* is determined as follows:

        ACCd= (40 \$1 10)/(40 \$1 10 \$1 40 \$1 10) = 0.5

The accuracy difference is thus AD = ACCa - ACCd = 0.7 - 0.5 = 0.2. This indicates there is a bias against facet *d* as the metric is positive.

The range of values for AD for binary and multicategory facet labels is [-1, \$11].
+ Positive values occur when the prediction accuracy for facet *a* is greater than that for facet *d*. It means that facet *d* suffers more from some combination of false positives (Type I errors) or false negatives (Type II errors). This means there is a potential bias against the disfavored facet *d*.
+ Values near zero occur when the prediction accuracy for facet *a* is similar to that for facet *d*.
+ Negative values occur when the prediction accuracy for facet *d* is greater than that for facet *a* t. It means that facet *a* suffers more from some combination of false positives (Type I errors) or false negatives (Type II errors). This means the is a bias against the favored facet *a*.

# Treatment Equality (TE)
<a name="clarify-post-training-bias-metric-te"></a>

The treatment equality (TE) is the difference in the ratio of false negatives to false positives between facets *a* and *d*. The main idea of this metric is to assess whether, even if the accuracy across groups is the same, is it the case that errors are more harmful to one group than another? Error rate comes from the total of false positives and false negatives, but the breakdown of these two maybe very different across facets. TE measures whether errors are compensating in the similar or different ways across facets. 

The formula for the treatment equality:

        TE = FNd/FPd - FNa/FPa

Where:
+ FNd are the false negatives predicted for facet *d*.
+ FPd are the false positives predicted for facet *d*.
+ FNa are the false negatives predicted for facet *a*.
+ FPa are the false positives predicted for facet *a*.

Note the metric becomes unbounded if FPa or FPd is zero.

For example, suppose that there are 100 loan applicants from facet *a* and 50 from facet *d*. For facet *a*, 8 were wrongly denied a loan (FNa) and another 6 were wrongly approved (FPa). The remaining predictions were true, so TPa \$1 TNa = 86. For facet *d*, 5 were wrongly denied (FNd) and 2 were wrongly approved (FPd). The remaining predictions were true, so TPd \$1 TNd = 43. The ratio of false negatives to false positives equals 8/6 = 1.33 for facet *a* and 5/2 = 2.5 for facet *d*. Hence TE = 2.5 - 1.33 = 1.167, even though both facets have the same accuracy:

        ACCa = (86)/(86\$1 8 \$1 6) = 0.86

        ACCd = (43)/(43 \$1 5 \$1 2) = 0.86

The range of values for differences in conditional rejection for binary and multicategory facet labels is (-∞, \$1∞). The TE metric is not defined for continuous labels. The interpretation of this metric depends on the relative important of false positives (Type I error) and false negatives (Type II error). 
+ Positive values occur when the ratio of false negatives to false positives for facet *d* is greater than that for facet *a*. 
+ Values near zero occur when the ratio of false negatives to false positives for facet *a* is similar to that for facet *d*. 
+ Negative values occur when the ratio of false negatives to false positives for facet *d* is less than that for facet *a*.

**Note**  
A previous version stated that the Treatment Equality metric is computed as FPa / FNa - FPd / FNd instead of FNd / FPd - FNa / FPa. While either of the versions can be used. For more information, see [https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf).

# Conditional Demographic Disparity in Predicted Labels (CDDPL)
<a name="clarify-post-training-bias-metric-cddpl"></a>

The demographic disparity metric (DDPL) determines whether facet *d* has a larger proportion of the predicted rejected labels than of the predicted accepted labels. It enables a comparison of difference in predicted rejection proportion and predicted acceptance proportion across facets. This metric is exactly the same as the pre-training CDD metric except that it is computed off the predicted labels instead of the observed ones. This metric lies in the range (-1,\$11).

The formula for the demographic disparity predictions for labels of facet *d* is as follows: 

        DDPLd = n'd(0)/n'(0) - n'd(1)/n'(1) = PdR(y'0) - PdA(y'1) 

Where: 
+ n'(0) = n'a(0) \$1 n'd(0) is the number of predicted rejected labels for facets *a* and *d*.
+ n'(1) = n'a(1) \$1 n'd(1) is the number of predicted accepted labels for facets *a* and *d*.
+ PdR(y'0) is the proportion of predicted rejected labels (value 0) in facet *d*.
+ PdA(y'1) is the proportion of predicted accepted labels (value 1) in facet *d*.

A conditional demographic disparity in predicted labels (CDDPL) metric that conditions DDPL on attributes that define a strata of subgroups on the dataset is needed to rule out Simpson's paradox. The regrouping can provide insights into the cause of apparent demographic disparities for less favored facets. The classic case arose in the case of Berkeley admissions where men were accepted at a higher rate overall than women. But when departmental subgroups were examined, women were shown to have higher admission rates than men by department. The explanation was that women had applied to departments with lower acceptance rates than men had. Examining the subgroup acceptance rates revealed that women were actually accepted at a higher rate than men for the departments with lower acceptance rates.

The CDDPL metric gives a single measure for all of the disparities found in the subgroups defined by an attribute of a dataset by averaging them. It is defined as the weighted average of demographic disparities in predicted labels (DDPLi) for each of the subgroups, with each subgroup disparity weighted in proportion to the number of observations in contains. The formula for the conditional demographic disparity in predicted labels is as follows:

        CDDPL = (1/n)\$1∑ini \$1DDPLi 

Where: 
+ ∑ini = n is the total number of observations and niis the number of observations for each subgroup.
+ DDPLi = n'i(0)/n(0) - n'i(1)/n(1) = PiR(y'0) - PiA(y'1) is the demographic disparity in predicted labels for the subgroup.

So the demographic disparity for a subgroup in predicted labels (DDPLi) are the difference between the proportion of predicted rejected labels and the proportion of predicted accepted labels for each subgroup. 

The range of DDPL values for binary, multicategory, and continuous outcomes is [-1,\$11]. 
+ \$11: when there are no predicted rejection labels for facet *a* or subgroup and no predicted acceptances for facet *d* or subgroup.
+ Positive values indicate there is a demographic disparity in predicted labels as facet *d* or subgroup has a larger proportion of the predicted rejected labels than of the predicted accepted labels. The higher the value the greater the disparity.
+ Values near zero indicate there is no demographic disparity on average.
+ Negative values indicate there is a demographic disparity in predicted labels as facet *a* or subgroup has a larger proportion of the predicted rejected labels than of the predicted accepted labels. The lower the value the greater the disparity.
+ -1: when there are no predicted rejection lapels for facet *d* or subgroup and no predicted acceptances for facet *a* or subgroup.

# Counterfactual Fliptest (FT)
<a name="clarify-post-training-bias-metric-ft"></a>

The fliptest is an approach that looks at each member of facet *d* and assesses whether similar members of facet *a* have different model predictions. The members of facet *a* are chosen to be k-nearest neighbors of the observation from facet *d*. We assess how many nearest neighbors of the opposite group receive a different prediction, where the flipped prediction can go from positive to negative and vice versa. 

The formula for the counterfactual fliptest is the difference in the cardinality of two sets divided by the number of members of facet *d*:

        FT = (F\$1 - F-)/nd

Where:
+ F\$1 = is the number of disfavored facet *d* members with an unfavorable outcome whose nearest neighbors in favored facet *a* received a favorable outcome. 
+ F- = is the number of disfavored facet *d* members with a favorable outcome whose nearest neighbors in favored facet *a* received an unfavorable outcome. 
+ nd is the sample size of facet *d*.

The range of values for the counterfactual fliptest for binary and multicategory facet labels is [-1, \$11]. For continuous labels, we set a threshold to collapse the labels to binary.
+ Positive values occur when the number of unfavorable counterfactual fliptest decisions for the disfavored facet *d* exceeds the favorable ones. 
+ Values near zero occur when the number of unfavorable and favorable counterfactual fliptest decisions balance out.
+ Negative values occur when the number of unfavorable counterfactual fliptest decisions for the disfavored facet *d* is less than the favorable ones.

# Generalized entropy (GE)
<a name="clarify-post-training-bias-metric-ge"></a>

The generalized entropy index (GE) measures the inequality in benefit `b` for the predicted label compared to the observed label. A benefit occurs when a false positive is predicted. A false positive occurs when a negative observation (y=0) has a positive prediction (y'=1). A benefit also occurs when the observed and predicted labels are the same, also known as a true positive and true negative. No benefit occurs when a false negative is predicted. A false negative occurs when a positive observation (y=1) is predicted to have a negative outcome (y'=0). The benefit `b` is defined, as follows.

```
 b = y' - y + 1
```

Using this definition, a false positive receives a benefit `b` of `2`, and a false negative receives a benefit of `0`. Both a true positive and a true negative receive a benefit of `1`.

The GE metric is computed following the [Generalized Entropy Index](https://en.wikipedia.org/wiki/Generalized_entropy_index) (GE) with the weight `alpha` set to `2`. This weight controls the sensitivity to different benefit values. A smaller `alpha` means an increased sensitivity to smaller values.

![\[Equation defining generalized entropy index with alpha parameter set to 2.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify-post-training-bias-metric-ge.png)


The following variables used to calculate GE are defined as follows:
+ bi is the benefit received by the `ith` data point.
+ b' is the mean of all benefits.

GE can range from 0 to 0.5, where values of zero indicate no inequality in benefits across all data points. This occurs either when all inputs are correctly predicted or when all the predictions are false positives. GE is undefined when all predictions are false negatives.

**Note**  
The metric GE does not depend on a facet value being either favored or disfavored.

# Model Explainability
<a name="clarify-model-explainability"></a>

Amazon SageMaker Clarify provides tools to help explain how machine learning (ML) models make predictions. These tools can help ML modelers and developers and other internal stakeholders understand model characteristics as a whole prior to deployment and to debug predictions provided by the model after it's deployed.
+ To obtain explanations for your datasets and models, see [Fairness, model explainability and bias detection with SageMaker Clarify](clarify-configure-processing-jobs.md).
+ To obtain explanations in real-time from a SageMaker AI endpoint, see [Online explainability with SageMaker Clarify](clarify-online-explainability.md).

Transparency about how ML models arrive at their predictions is also critical to consumers and regulators. They need to trust the model predictions if they are going to accept the decisions based on them. SageMaker Clarify uses a model-agnostic feature attribution approach. You can use this to understand why a model made a prediction after training, and to provide per-instance explanation during inference. The implementation includes a scalable and efficient implementation of [SHAP](https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf). This is based on the concept of a Shapley value, from the field of cooperative game theory, that assigns each feature an importance value for a particular prediction.

Clarify produces partial dependence plots (PDPs) that show the marginal effect features have on the predicted outcome of a machine learning model. Partial dependence helps explain target response given a set of input features. It also supports both computer vision (CV) and natural language processing (NLP) explainability using the same Shapley values (SHAP) algorithm as used for tabular data explanations.

What is the function of an explanation in the machine learning context? An explanation can be thought of as the answer to a *Why question* that helps humans understand the cause of a prediction. In the context of an ML model, you might be interested in answering questions such as: 
+ Why did the model predict a negative outcome such as a loan rejection for a given applicant? 
+ How does the model make predictions?
+ Why did the model make an incorrect prediction?
+ Which features have the largest influence on the behavior of the model?

You can use explanations for auditing and meeting regulatory requirements, building trust in the model and supporting human decision-making, and debugging and improving model performance.

The need to satisfy the demands for human understanding about the nature and outcomes of ML inference is key to the sort of explanation needed. Research from philosophy and cognitive science disciplines has shown that people care especially about contrastive explanations, or explanations of why an event X happened instead of some other event Y that did not occur. Here, X could be an unexpected or surprising event that happened and Y corresponds to an expectation based on their existing mental model referred to as a *baseline*. Note that for the same event X, different people might seek different explanations depending on their point of view or mental model Y. In the context of explainable AI, you can think of X as the example being explained and Y as a baseline that is typically chosen to represent an uninformative or average example in the dataset. Sometimes, for example in the case of ML modeling of images, the baseline might be implicit, where an image whose pixels are all the same color can serves as a baseline.

## Sample Notebooks
<a name="clarify-model-explainability-sample-notebooks"></a>

Amazon SageMaker Clarify provides the following sample notebook for model explainability:
+ [Amazon SageMaker Clarify Processing](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/index.html#sagemaker-clarify-processing) – Use SageMaker Clarify to create a processing job for the detecting bias and explaining model predictions with feature attributions. Examples include using CSV and JSON Lines data formats, bringing your own container, and running processing jobs with Spark.
+ [Explaining Image Classification with SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-clarify/computer_vision/image_classification/explainability_image_classification.ipynb) – SageMaker Clarify provides you with insights into how your computer vision models classify images.
+ [Explaining object detection models with SageMaker Clarify ](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/computer_vision/object_detection/object_detection_clarify.ipynb) – SageMaker Clarify provides you with insights into how your computer vision models detect objects.

This notebook has been verified to run in Amazon SageMaker Studio only. If you need instructions on how to open a notebook in Amazon SageMaker Studio, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md). If you're prompted to choose a kernel, choose **Python 3 (Data Science)**.

**Topics**
+ [Sample Notebooks](#clarify-model-explainability-sample-notebooks)
+ [Feature Attributions that Use Shapley Values](clarify-shapley-values.md)
+ [Asymmetric Shapley Values](clarify-feature-attribute-shap-asymm.md)
+ [SHAP Baselines for Explainability](clarify-feature-attribute-shap-baselines.md)

# Feature Attributions that Use Shapley Values
<a name="clarify-shapley-values"></a>

SageMaker Clarify provides feature attributions based on the concept of [Shapley value](https://en.wikipedia.org/wiki/Shapley_value). You can use Shapley values to determine the contribution that each feature made to model predictions. These attributions can be provided for specific predictions and at a global level for the model as a whole. For example, if you used an ML model for college admissions, the explanations could help determine whether the GPA or the SAT score was the feature most responsible for the model’s predictions, and then you can determine how responsible each feature was for determining an admission decision about a particular student.

SageMaker Clarify has taken the concept of Shapley values from game theory and deployed it in a machine learning context. The Shapley value provides a way to quantify the contribution of each player to a game, and hence the means to distribute the total gain generated by a game to its players based on their contributions. In this machine learning context, SageMaker Clarify treats the prediction of the model on a given instance as the *game* and the features included in the model as the *players*. For a first approximation, you might be tempted to determine the marginal contribution or effect of each feature by quantifying the result of either *dropping* that feature from the model or *dropping* all other features from the model. However, this approach does not take into account that features included in a model are often not independent from each other. For example, if two features are highly correlated, dropping either one of the features might not alter the model prediction significantly. 

To address these potential dependencies, the Shapley value requires that the outcome of each possible combination (or coalition) of features must be considered to determine the importance of each feature. Given *d* features, there are 2d such possible feature combinations, each corresponding to a potential model. To determine the attribution for a given feature *f*, consider the marginal contribution of including *f* in all feature combinations (and associated models) that do not contain *f*, and take the average. It can be shown that Shapley value is the unique way of assigning the contribution or importance of each feature that satisfies certain desirable properties. In particular, the sum of Shapley values of each feature corresponds to the difference between the predictions of the model and a dummy model with no features. However, even for reasonable values of *d*, say 50 features, it is computationally prohibitive and impractical to train 2d possible models. As a result, SageMaker Clarify needs to make use of various approximation techniques. For this purpose, SageMaker Clarify uses Shapley Additive exPlanations (SHAP), which incorporates such approximations and devised a scalable and efficient implementation of the Kernel SHAP algorithm through additional optimizations. 

For additional information on Shapley values, see [A Unified Approach to Interpreting Model Predictions](https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf).

# Asymmetric Shapley Values
<a name="clarify-feature-attribute-shap-asymm"></a>

The SageMaker Clarify time series forecasting model explanation solution is a feature attribution method rooted in [cooperative game theory](https://en.wikipedia.org/wiki/Cooperative_game_theory), similar in spirit to SHAP. Specifically, Clarify uses [random order group values](http://www.library.fa.ru/files/Roth2.pdf#page=121), also known as [asymmetric Shapley values](https://proceedings.neurips.cc/paper/2020/file/0d770c496aa3da6d2c3f2bd19e7b9d6b-Paper.pdf) in machine learning and explainability.

## Background
<a name="clarify-feature-attribute-shap-asymm-setting"></a>

The goal is to compute attributions for input features to a given forecasting model *f*. The forecasting model takes the following inputs:
+ Past time series *(target TS)*. For example, this could be past daily train passengers in the Paris-Berlin route, denoted by *xt​*.
+ (Optional) A covariate time series. For example, this could be festivities and weather data, denoted by *zt* ​∈ RS. When used, covariate TS could be available only for the past time steps or also for the future ones (included in the festivity calendar).
+ (Optional) Static covariates, such as quality of service (like 1st or 2nd class), denoted by *u* ∈ RE.

Static covariates, dynamic covariates, or both can be omitted, depending on the specific application scenario. Given a prediction horizon K ≥ 0 (e.g. K=30 days) the model prediction can be characterized by the formula: *f(x[1:T], z[1:T\$1K], u) = x[T\$11:T \$1K\$11]*.

The following diagram shows a dependency structure for a typical forecasting model. The prediction at time *t\$11* depends on the three types of inputs previously mentioned.

![\[Dependency structure for a typical forecasting model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/clarify-forecast-dependency.png)


## Method
<a name="clarify-feature-attribute-shap-asymm-explan"></a>

Explanations are computed by querying the time series model *f* on a series of points derived by the original input. Following game theoretic constructions, Clarify averages differences in predictions led by obfuscating (that is, setting to a baseline value) parts of the inputs iteratively. The temporal structure can be navigated in a chronological or anti-chronological order or both. Chronological explanations are built by iteratively adding information from the first time step, while anti-chronological from the last step. The latter mode may be more appropriate in the presence of recency bias, such as when forecasting stock prices. One important property of the computed explanations is that they sum to the original model output if the model provides deterministic outputs.

## Resulting attributions
<a name="clarify-feature-attribute-shap-asymm-attr"></a>

Resulting attributions are scores that mark individual contributions of specific time steps or input features toward the final forecast at each forecasted time step. Clarify offers the following two granularities for explanations:
+ Timewise explanations are inexpensive and provide information about specific time steps only, such as how much the information of the 19th day in the past contributed to the forecasting of the 1st day in the future. These attributions do not explain individually static covariates and aggregate explanations of target and covariate time series. The attributions are a matrix *A* where each *Atk​* is the attribution of time step *t* toward forecasting of time step *T\$1k*. Note that if the model accepts future covariates, *t* can be greater than *T*.
+ Fine-grained explanations are more computationally intensive and provide a full breakdown of all attributions of the input variables.
**Note**  
Fine-grained explanations only support chronological order.

  The resulting attributions are a triplet composed of the following:
  + Matrix *Ax* ∈ RT×K related to the input time series, where *Atkx​* is the attribution of *xt* toward forecasting step *T\$1k*
  + Tensor *Az* ∈ *RT\$1K×S×K* related to the covariate time series, where *Atskz​* is the attribution of *zts​* (i.e. the sth covariate TS) toward forecasting step *T\$1k*
  + Matrix *Au* ∈ RE×K related to the static covariates, where *Aeku​* is the attribution of *ue* ​(the eth static covariate) toward forecasting step *T\$1k*

Regardless of the granularity, the explanation also contains an offset vector *B* ∈ *RK* that represents the “basic behavior” of the model when all data is obfuscated.

# SHAP Baselines for Explainability
<a name="clarify-feature-attribute-shap-baselines"></a>

Explanations are typically contrastive (that is, they account for deviations from a baseline). As a result, for the same model prediction, you can expect to get different explanations with respect to different baselines. Therefore, your choice of a baseline is crucial. In an ML context, the baseline corresponds to a hypothetical instance that can be either *uninformative* or *informative*. During the computation of Shapley values, SageMaker Clarify generates several new instances between the baseline and the given instance, in which the absence of a feature, is modeled by setting the feature value to that of the baseline and the presence of a feature is modeled by setting the feature value to that of the given instance. Thus, the absence of all features corresponds to the baseline and the presence of all features corresponds to the given instance. 

How can you choose good baselines? Often it is desirable to select a baseline with very low information content. For example, you can construct an average instance from the training dataset by taking either the median or average for numerical features and the mode for categorical features. For the college admissions example, you might be interested in explaining why a particular applicant was accepted as compared to a baseline acceptances based on an average applicant. If not provided, a baseline is calculated automatically by SageMaker Clarify using K-means or K-prototypes in the input dataset.

Alternatively, you can choose to generate explanations with respect to informative baselines. For the college admissions scenario, you might want to explain why a particular applicant was rejected when compared with other applicants from similar demographic backgrounds. In this case, you can choose a baseline that represents the applicants of interest, namely those from a similar demographic background. Thus, you can use informative baselines to concentrate the analysis on the specific aspects of a particular model prediction. You can isolate the features for assessment by setting demographic attributes and other features that you can't act on to the same value as in the given instance.