

# Evaluate the performance of an Amazon Bedrock model
<a name="evaluation"></a>

With Amazon Bedrock in SageMaker Unified Studio, you can use automatic model evaluations to quickly evaluate the performance and effectiveness of Amazon Bedrock foundation models. To evaluate a model you create an evaluation job. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization. The results of a model evaluation job allow you to compare model outputs, and then choose the model best suited for your needs. You can view performance metrics, such as the semantic robustness of a model. Automatic evaluations produce calculated scores and metrics that help you assess the effectiveness of a model. 

Amazon Bedrock in SageMaker Unified Studio doesn't support Human-based evaluations. For more information, see [Model evaluation jobs](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html) in the *Amazon Bedrock user guide*.

**Important**  
In Amazon Bedrock in SageMaker Unified Studio, you can view the model evaluation jobs in your project. However, the Amazon Bedrock API allows users to list all model evaluation jobs in the AWS account that hosts the project. We don't recommend including sensitive information in model evaluation jobs metadata.   
If you delete a Amazon SageMaker Unified Studio project, or if your admin deletes your domain, your model evaluation jobs are not automatically deleted. If you don't delete your jobs before the project or domain is deleted, you will need to use the Amazon Bedrock console to delete the jobs. Contact your administrator if you don't have access to the Amazon Bedrock in SageMaker Unified Studio console. 

This section shows you how to create and manage model evaluation jobs, and the kinds of performance metrics you can use. This section also describes the available built-in datasets and how to specify your own dataset.

**Topics**
+ [Create a model evaluation job with Amazon Bedrock](model-evaluation-jobs-management-create.md)
+ [Model evaluation task types in Amazon Bedrock](model-evaluation-tasks.md)
+ [Use prompt datasets for model evaluation in Amazon Bedrock](model-evaluation-prompt-datasets.md)
+ [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md)

# Create a model evaluation job with Amazon Bedrock
<a name="model-evaluation-jobs-management-create"></a>

When you create a model evaluation job, you specify the model, task type, and prompt dataset that you want to the job to use. You also specify the metrics that you want the job to create.

To create a model evaluation job, you must have access to an Amazon Bedrock model that supports model evaluation. For more information, see [Model support by feature](https://docs.aws.amazon.com/bedrock/latest/userguide/models-features.html) in the *Amazon Bedrock user guide*. If you don't have access to a suitable model, contact your administrator. 

 Model evaluation supports the following task types that assess different aspects of the model's performance:
+ **[General text generation](model-evaluation-tasks-general-text.md)** – the model performs natural language processing and text generation tasks.
+ **[Text summarization](model-evaluation-tasks-text-summary.md)** – the model performs summarizes text based on the prompts you provide.
+ **[Question and answer](model-evaluation-tasks-question-answer.md)** – the model provides answers based on your prompts.
+ **[Text classification](model-evaluation-text-classification.md)** – the model categorizes text into predefined classes based on the input dataset.

To perform a model evaluation for a task type, Amazon Bedrock in SageMaker Unified Studio needs an input dataset that contains prompts. The job uses the dataset for inference during evaluation. You can use a [built in](model-evaluation-prompt-datasets-builtin.md) dataset that Amazon Bedrock in SageMaker Unified Studio suppplies or supply your own [custom](model-evaluation-prompt-datasets-custom.md) prompt dataset. To create a custom prompt dataset, use the information at [custom prompt](model-evaluation-prompt-datasets-custom.md). When you supply your own dataset, Amazon Bedrock in SageMaker Unified Studio uploads the dataset to an Amazon S3 bucket that it manages. You can get the location from the Amazon S3 section of your project's **Data Store**. You can also use a custom dataset that you have previously uploaded to the Data Store. 

You can choose from the following the metrics that you want the model evaluation job to create. 
+ **Toxicity** – The presence of harmful, abusive, or undesirable content generated by the model. 
+ **Accuracy** – The model's ability to generate outputs that are factually correct, coherent, and aligned with the intended task or query. 
+ **Robustness** – The model's ability to maintain consistent and reliable performance in the face of various types of challenges or perturbations.

How the model evaluation job applies the metrics depends on the task type that you choose. For more information, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md).

You can tag model evaluation jobs for purposes such as tracking costs. Amazon Bedrock in SageMaker Unified Studio automatically prepends tags you add with *ProjectUserTag*. To view the tags that you add, you need to use the tag editor in the AWS Resource Groups console. For more information, see [What is Tag Editor?](https://docs.aws.amazon.com/tag-editor/latest/userguide/gettingstarted.html) in the *AWS Resource Management Documentation*.

You can set the inference parameters for the model evaluation job. You can change *Max tokens*, *temperature*, and *Top P* inference parameters. Models might support other parameters that you can change. For more information, see [Inference request parameters and response fields for foundation models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters.html) in the *Amazon Bedrock user guide*.

**To create an automatic model evaluation job**

1. Navigate to the Amazon SageMaker Unified Studio landing page by using the URL from your administrator.

1. Access Amazon SageMaker Unified Studio using your IAM or single sign-on (SSO) credentials. For more information, see [Access Amazon SageMaker Unified Studio](getting-started-access-the-portal.md).

1. If you want to use a new project, do the following:

   1. Choose the current project at the top of the page. If a project isn't already open, choose **Select a project**.

   1. Select **Create project**. 

   1. Follow the instructions at [Create a new project](create-new-project.md). For the **Project profile** in step 1, choose **Generative AI application development**.

1. If the project that you want to use isn't already open, do the following:

   1. Choose the current project at the top of the page. If a project isn't already open, choose **Select a project**.

   1. Select **Browse all projects**. 

   1. In **Projects** select the project that you want to use.

1. At the top of the page, select **Build**. 

1. In the **MACHINE LEARNING & GENERATIVE AI** section, under **AI OPS**, choose **Model evaluations**. 

1. Choose **Create evaluation** to open the **Create evaluation** page and start step 1 (specify details).

1. For **Evaluation job name**, enter a name for the evaluation job. This name is shown in your model evaluation job list. 

1. (Optional) For **Description** enter a description.

1. (Optional) For **Tags** add tags for that you want to attach to the model evaluation job. 

1. Choose **Next** to start step 2 (set up evaluation).

1. In **Model selector**, select a model by selecting the **Model provider** and then the **Model**. 

1. (Optional) To change the inference configuration choose **update** to open the **Inference configurations** pane.

1. In **Task type**, choose the type of task you want the model evaluation job to perform. For information about the available task types, see [Model evaluation task types in Amazon Bedrock](model-evaluation-tasks.md).

1. For the task type, choose which metrics that you want the evaluation job to collect. For information about available metrics, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md). 

1. For each metric, select the dataset that you want to use in **Choose an evaluation dataset**.
   + To use a [built in](model-evaluation-prompt-datasets-builtin.md) dataset, choose **Built in datasets** and choose the metrics that you want use.
   + To upload a [custom dataset](model-evaluation-prompt-datasets-custom.md), choose **Upload a dataset to S3** and upload the dataset file. 
   + To use an existing custom dataset, choose **Choose a dataset from S3** and select the previously uploaded custom dataset. 

1. Choose **Next** to start step 3 (review and submit).

1. Check that the evaluation job details are correct.

1. Choose **Submit** to start the model evaluation job.

1. Wait until the model evaluation job finishes. The job is complete when its status **Success** on the model evaluations page.

1. Next step: [Review](model-evaluation-report.md) the results of the model evaluation job.

If you decide to stop the model evaluation job, open the model evaluations page, choose the model evaluation job, and choose **Stop**. To delete the evaluation, choose **Stop**.

# Model evaluation task types in Amazon Bedrock
<a name="model-evaluation-tasks"></a>

In a model evaluation job, an evaluation task type is a task you want the model to perform based on information in your prompts. You can choose one task type per model evaluation job.

The following table summarizes available tasks types for automatic model evaluations, built-in datasets, and relevant metrics for each task type.


**Available built-in datasets for automatic model evaluation jobs in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-tasks.html)

**Topics**
+ [General text generation for model evaluation in Amazon Bedrock](model-evaluation-tasks-general-text.md)
+ [Text summarization for model evaluation in Amazon Bedrock](model-evaluation-tasks-text-summary.md)
+ [Question and answer for model evaluation in Amazon Bedrock](model-evaluation-tasks-question-answer.md)
+ [Text classification for model evaluation in Amazon Bedrock](model-evaluation-text-classification.md)

# General text generation for model evaluation in Amazon Bedrock
<a name="model-evaluation-tasks-general-text"></a>

General text generation is a task used by applications that include chatbots. The responses generated by a model to general questions are influenced by the correctness, relevance, and bias contained in the text used to train the model.

**Important**  
For general text generation, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.

The following built-in datasets contain prompts that are well-suited for use in general text generation tasks.

**Bias in Open-ended Language Generation Dataset (BOLD)**  
The Bias in Open-ended Language Generation Dataset (BOLD) is a dataset that evaluates fairness in general text generation, focusing on five domains: profession, gender, race, religious ideologies, and political ideologies. It contains 23,679 different text generation prompts.

**RealToxicityPrompts**  
RealToxicityPrompts is a dataset that evaluates toxicity. It attempts to get the model to generate racist, sexist, or otherwise toxic language. This dataset contains 100,000 different text generation prompts.

**T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples (TREX)**  
TREX is dataset consisting of Knowledge Base Triples (KBTs) extracted from Wikipedia. KBTs are a type of data structure used in natural language processing (NLP) and knowledge representation. They consist of a subject, predicate, and object, where the subject and object are linked by a relation. An example of a Knowledge Base Triple (KBT) is "George Washington was the president of the United States". The subject is "George Washington", the predicate is "was the president of", and the object is "the United States".

**WikiText2**  
WikiText2 is a HuggingFace dataset that contains prompts used in general text generation.

The following table summarizes the metrics calculated, and recommended built-in dataset that are available for automatic model evaluation jobs. 


**Available built-in datasets for general text generation in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-tasks-general-text.html)

To learn more about how the computed metric for each built-in dataset is calculated, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md)

# Text summarization for model evaluation in Amazon Bedrock
<a name="model-evaluation-tasks-text-summary"></a>

Text summarization is used for tasks including creating summaries of news, legal documents, academic papers, content previews, and content curation. The ambiguity, coherence, bias, and fluency of the text used to train the model as well as information loss, accuracy, relevance, or context mismatch can influence the quality of responses.

**Important**  
For text summarization, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.

The following built-in dataset is supported for use with the task summarization task type.

**Gigaword**  
The Gigaword dataset consists of news article headlines. This dataset is used in text summarization tasks.

The following table summarizes the metrics calculated, and recommended built-in dataset. 


**Available built-in datasets for text summarization in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-tasks-text-summary.html)

To learn more about how the computed metric for each built-in dataset is calculated, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md)

# Question and answer for model evaluation in Amazon Bedrock
<a name="model-evaluation-tasks-question-answer"></a>

Question and answer is used for tasks including generating automatic help-desk responses, information retrieval, and e-learning. If the text used to train the foundation model contains issues including incomplete or inaccurate data, sarcasm or irony, the quality of responses can deteriorate.

**Important**  
For question and answer, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.

The following built-in datasets are recommended for use with the question andg answer task type.

**BoolQ**  
BoolQ is a dataset consisting of yes/no question and answer pairs. The prompt contains a short passage, and then a question about the passage. This dataset is recommended for use with question and answer task type.

**Natural Questions**  
Natural questions is a dataset consisting of real user questions submitted to Google search.

**TriviaQA**  
TriviaQA is a dataset that contains over 650K question-answer-evidence-triples. This dataset is used in question and answer tasks.

The following table summarizes the metrics calculated, and recommended built-in dataset. 


**Available built-in datasets for the question and answer task type in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-tasks-question-answer.html)

To learn more about how the computed metric for each built-in dataset is calculated, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md)

# Text classification for model evaluation in Amazon Bedrock
<a name="model-evaluation-text-classification"></a>

Text classification is used to categorize text into pre-defined categories. Applications that use text classification include content recommendation, spam detection, language identification and trend analysis on social media. Imbalanced classes, ambiguous data, noisy data, and bias in labeling are some issues that can cause errors in text classification.

**Important**  
For text classification, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.

The following built-in datasets are recommended for use with the text classification task type.

**Women's E-Commerce Clothing Reviews**  
Women's E-Commerce Clothing Reviews is a dataset that contains clothing reviews written by customers. This dataset is used in text classification tasks. 

The following table summarizes the metrics calculated, and recommended built-in datasets. 




**Available built-in datasets in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-text-classification.html)

To learn more about how the computed metric for each built-in dataset is calculated, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md)

# Use prompt datasets for model evaluation in Amazon Bedrock
<a name="model-evaluation-prompt-datasets"></a>

To create a model evaluation job you must specify a prompt dataset the model uses during inference. Amazon Bedrock in SageMaker Unified Studio provides built-in datasets that can be used in automatic model evaluations, or you can bring your own prompt dataset. 

Use the following sections to learn more about available built-in prompt datasets and creating your custom prompt datasets.

To learn more about creating your first model evaluation job in Amazon Bedrock, see [Create a model evaluation job with Amazon Bedrock](model-evaluation-jobs-management-create.md).

**Topics**
+ [Use built-in prompt datasets for automatic model evaluation in Amazon Bedrock](model-evaluation-prompt-datasets-builtin.md)
+ [Use custom prompt dataset for model evaluation in Amazon Bedrock in SageMaker Unified Studio](model-evaluation-prompt-datasets-custom.md)

# Use built-in prompt datasets for automatic model evaluation in Amazon Bedrock
<a name="model-evaluation-prompt-datasets-builtin"></a>

Amazon Bedrock provides multiple built-in prompt datasets that you can use in an automatic model evaluation job. Each built-in dataset is based off an open-source dataset. We have randomly down sampled each open-source dataset to include only 100 prompts.

When you create an automatic model evaluation job and choose a **Task type** Amazon Bedrock provides you with a list of recommended metrics. For each metric, Amazon Bedrock also provides recommended built-in datasets. To learn more about available task types, see [Model evaluation task types in Amazon Bedrock](model-evaluation-tasks.md).

**Bias in Open-ended Language Generation Dataset (BOLD)**  
The Bias in Open-ended Language Generation Dataset (BOLD) is a dataset that evaluates fairness in general text generation, focusing on five domains: profession, gender, race, religious ideologies, and political ideologies. It contains 23,679 different text generation prompts.

**RealToxicityPrompts**  
RealToxicityPrompts is a dataset that evaluates toxicity. It attempts to get the model to generate racist, sexist, or otherwise toxic language. This dataset contains 100,000 different text generation prompts.

**T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples (TREX)**  
TREX is dataset consisting of Knowledge Base Triples (KBTs) extracted from Wikipedia. KBTs are a type of data structure used in natural language processing (NLP) and knowledge representation. They consist of a subject, predicate, and object, where the subject and object are linked by a relation. An example of a Knowledge Base Triple (KBT) is "George Washington was the president of the United States". The subject is "George Washington", the predicate is "was the president of", and the object is "the United States".

**WikiText2**  
WikiText2 is a HuggingFace dataset that contains prompts used in general text generation.

**Gigaword**  
The Gigaword dataset consists of news article headlines. This dataset is used in text summarization tasks.

**BoolQ**  
BoolQ is a dataset consisting of yes/no question and answer pairs. The prompt contains a short passage, and then a question about the passage. This dataset is recommended for use with question and answer task type.

**Natural Questions **  
Natural question is a dataset consisting of real user questions submitted to Google search.

**TriviaQA**  
TriviaQA is a dataset that contains over 650K question-answer-evidence-triples. This dataset is used in question and answer tasks.

**Women's E-Commerce Clothing Reviews**  
Women's E-Commerce Clothing Reviews is a dataset that contains clothing reviews written by customers. This dataset is used in text classification tasks. 

In the following table, you can see the list of available datasets grouped task type. To learn more about how automatic metrics are computed, see [Review a model model evaluation job in Amazon Bedrock](model-evaluation-report.md). 


**Available built-in datasets for automatic model evaluation jobs in Amazon Bedrock**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/model-evaluation-prompt-datasets-builtin.html)

To learn more about the requirements for creating and examples of custom prompt datasets, see [Use custom prompt dataset for model evaluation in Amazon Bedrock in SageMaker Unified Studio](model-evaluation-prompt-datasets-custom.md).

# Use custom prompt dataset for model evaluation in Amazon Bedrock in SageMaker Unified Studio
<a name="model-evaluation-prompt-datasets-custom"></a>

You can use a custom prompt dataset in model evaluation jobs.

In model evaluation jobs you can use a custom prompt dataset for each metric you select in the model evaluation job. Custom datasets use the JSON line format (`.jsonl`), and each line must be a valid JSON object. There can be up to 1000 prompts in your dataset per automatic evaluation job.

You must use the following keys in a custom dataset.
+ `prompt` – required to indicate the input for the following tasks:
  + The prompt that your model should respond to, in general text generation.
  + The question that your model should answer in the question and answer task type.
  + The text that your model should summarize in text summarization task.
  + The text that your model should classify in classification tasks.
+ `referenceResponse` – required to indicate the ground truth response against which your model is evaluated for the following tasks types:
  + The answer for all prompts in question and answer tasks.
  + The answer for all accuracy, and robustness evaluations.
+ `category`– (optional) generates evaluation scores reported for each category. 

As an example, accuracy requires both the question to ask and the answer to check the model response against. In this example, use the key `prompt` with the value contained in the question, and the key `referenceResponse` with the value contained in the answer as follows.

```
{
    "prompt": "Bobigny is the capital of",
    "referenceResponse": "Seine-Saint-Denis",
    "category": "Capitals"
}
```

The previous example is a single line of a JSON line input file that will be sent to your model as an inference request. Model will be invoked for every such record in your JSON line dataset. The following data input example is for a question answer task that uses an optional `category` key for evaluation.

```
{"prompt":"Aurillac is the capital of", "category":"Capitals", "referenceResponse":"Cantal"}
{"prompt":"Bamiyan city is the capital of", "category":"Capitals", "referenceResponse":"Bamiyan Province"}
{"prompt":"Sokhumi is the capital of", "category":"Capitals", "referenceResponse":"Abkhazia"}
```

# Review a model model evaluation job in Amazon Bedrock
<a name="model-evaluation-report"></a>

The results of a model evaluation job are presented in a report, and include key metrics that can help you assess the model performance and effectiveness. In your model evaluation report, you will see an evaluation summary and sections for each of the metrics that you chose for the evaluation job. responses. 

**Topics**
+ [Viewing a model evaluation report](#model-evaluation-report-procedure)
+ [Understanding a model evaluation report](#model-evaluation-report-understanding)

## Viewing a model evaluation report
<a name="model-evaluation-report-procedure"></a>

**To view a model evaluation report**

1. Navigate to the Amazon SageMaker Unified Studio landing page by using the URL from your administrator.

1. Access Amazon SageMaker Unified Studio using your IAM or single sign-on (SSO) credentials. For more information, see [Access Amazon SageMaker Unified Studio](getting-started-access-the-portal.md).

1. If the project that you want to use isn't already open, do the following:

   1. Choose the current project at the top of the page. If a project isn't already open, choose **Select a project**.

   1. Select **Browse all projects**. 

   1. In **Projects** select the project that you want to use.

1. Choose the **Build** menu option at the top of the page.

1. In **MACHINE LEARNING & GENERATIVE AI** choose **My apps**.

1. From the navigation pane, choose **Build** and then **Model evaluations**. 

1. In the **Model evaluation jobs** table choose the name of the model evaluation job you want to review. The model evaluation card opens.

## Understanding a model evaluation report
<a name="model-evaluation-report-understanding"></a>

In the **Evaluation summary** you can see the task type and tasks metrics that the evaluation job calculated.

For each metric, the report contains the dataset, the calculated metric value for the dataset, the total number of prompts in the dataset, and how many of those prompts received. How the metric value is calculated changes based on the task type and the metrics you selected.

In all semantic robustness related metrics, Amazon Bedrock in SageMaker Unified Studio perturbs prompts in the following ways: convert text to all lower cases, keyboard typos, converting numbers to words, random changes to upper case and random addition/deletion of whitespaces.

**How each available metric is calculated when applied to the general text generation task type**
+ **Accuracy**: For this metric, the value is calculated using real world knowledge score (RWK score). RWK score examines the model’s ability to encode factual knowledge about the real world. A high RWK score indicates that your model is being accurate.
+ **Robustness**: For this metric, the value is calculated using semantic robustness. Which is calculated using word error rate. Semantic robustness measures how much the model output changes as a result of minor, semantic preserving perturbations, in the input. Robustness to such perturbations is a desirable property, and thus a low semantic robustness score indicated your model is performing well.

  The perturbation types we will consider are: convert text to all lower cases, keyboard typos, converting numbers to words, random changes to upper case and random addition/deletion of whitespaces. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the text summarization task type**
+ **Accuracy**: For this metric, the value is calculated using BERT Score. BERT Score is calculated using pre-trained contextual embeddings from BERT models. It matches words in candidate and reference sentences by cosine similarity.
+ **Robustness**: For this metric, the value calculated is a percentage. It calculated by taking (Delta BERTScore / BERTScore) x 100. Delta BERTScore is the difference in BERT Scores between a perturbed prompt and the original prompt in your dataset. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the question and answer task type**
+ **Accuracy**: For this metric, the value calculated is F1 score. F1 score is calculated by dividing the precision score (the ratio of correct predictions to all predictions) by the recall score (the ratio of correct predictions to the total number of relevant predictions). The F1 score ranges from 0 to 1, with higher values indicating better performance.
+ **Robustness**: For this metric, the value calculated is a percentage. It is calculated by taking (Delta F1 / F1) x 100. Delta F1 is the difference in F1 Scores between a perturbed prompt and the original prompt in your dataset. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the text classification task type**
+ **Accuracy**: For this metric, the value calculated is accuracy. Accuracy is a score that compares the predicted class to its ground truth label. A higher accuracy indicates that your model is correctly classifying text based on the ground truth label provided.
+ **Robustness**: For this metric, the value calculated is a percentage. It is calculated by taking (delta classification accuracy score / classification accuracy score) x 100. Delta classification accuracy score is the difference between the classification accuracy score of the perturbed prompt and the original input prompt. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.

In the **Job configuration summary**, you can see the model and the inference parameters that the job used.