

# Review model evaluation job reports and metrics in Amazon Bedrock
<a name="model-evaluation-report"></a>

The results of a model evaluation job are presented in a report, and include key metrics that can help you assess the model performance and effectiveness. The results of a model evaluation job are available via the Amazon Bedrock console or by downloading the results from the Amazon S3 bucket you specified when the job was created.

Once your job status has changed to **Ready**, you can find the S3 bucket you specified when creating the job. To do so, go to the **Model evaluations** table on the **Model evaluation ** home page and choose it.

Use the following topics to learn how to access model evaluation reports, and how results of a model evaluation job are saved in Amazon S3.

**Topics**
+ [Review metrics for an automated model evaluation job in Amazon Bedrock (console)](model-evaluation-report-programmatic.md)
+ [Review a human-based model evaluation job in Amazon Bedrock (console)](model-evaluation-report-human-customer.md)
+ [Understand how the results of your model evaluation job are saved in Amazon S3](model-evaluation-report-s3.md)

# Review metrics for an automated model evaluation job in Amazon Bedrock (console)
<a name="model-evaluation-report-programmatic"></a>

You can review the metrics presented in a report for an automatic model evaluation job using the Amazon Bedrock console.

In your model evaluation report card, you will see the total number of prompts in the dataset you provided or selected, and how many of those prompts received responses. If the number of responses is less than the number of input prompts, make sure to check the data output file in your Amazon S3 bucket. It is possible that the prompt caused an error with the model and there was no inference retrieved. Only responses from the model will be used in metric calculations.

Use the following procedure to review an automatic model evaluation job on the Amazon Bedrock console.

1. Open the Amazon Bedrock console.

1. From the navigation pane, choose **Model evaluation**.

1. Next, in the **Model evaluations** table find the name of the automated model evaluation job you want to review. Then, choose it.

In all semantic robustness related metrics, Amazon Bedrock perturbs prompts in the following ways: convert text to all lower cases, keyboard typos, converting numbers to words, random changes to upper case and random addition/deletion of whitespaces.

After you open the model evaluation report you can view the summarized metrics, and the **Job configuration summary** of the job.

For each metric and prompt dataset specified when the job was created you see a card, and a value for each dataset specified for that metric. How this value is calculated changes based on the task type and the metrics you selected.

**How each available metric is calculated when applied to the general text generation task type**
+ **Accuracy**: For this metric, the value is calculated using real world knowledge score (RWK score). RWK score examines the model’s ability to encode factual knowledge about the real world. A high RWK score indicates that your model is being accurate.
+ **Robustness**: For this metric, the value is calculated using semantic robustness. Which is calculated using word error rate. Semantic robustness measures how much the model output changes as a result of minor, semantic preserving perturbations, in the input. Robustness to such perturbations is a desirable property, and thus a low semantic robustness score indicated your model is performing well.

  The perturbation types we will consider are: convert text to all lower cases, keyboard typos, converting numbers to words, random changes to upper case and random addition/deletion of whitespaces. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the text summarization task type**
+ **Accuracy**: For this metric, the value is calculated using BERT Score. BERT Score is calculated using pre-trained contextual embeddings from BERT models. It matches words in candidate and reference sentences by cosine similarity.
+ **Robustness**: For this metric, the value calculated is a percentage. It calculated by taking (Delta BERTScore / BERTScore) x 100. Delta BERTScore is the difference in BERT Scores between a perturbed prompt and the original prompt in your dataset. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the question and answer task type**
+ **Accuracy**: For this metric, the value calculated is F1 score. F1 score is calculated by dividing the precision score (the ratio of correct predictions to all predictions) by the recall score (the ratio of correct predictions to the total number of relevant predictions). The F1 score ranges from 0 to 1, with higher values indicating better performance.
+ **Robustness**: For this metric, the value calculated is a percentage. It is calculated by taking (Delta F1 / F1) x 100. Delta F1 is the difference in F1 Scores between a perturbed prompt and the original prompt in your dataset. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.
+ **Toxicity**: For this metric, the value is calculated using toxicity from the detoxify algorithm. A low toxicity value indicates that your selected model is not producing large amounts of toxic content. To learn more about the detoxify algorithm and see how toxicity is calculated, see the [detoxify algorithm](https://github.com/unitaryai/detoxify) on GitHub.

**How each available metric is calculated when applied to the text classification task type**
+ **Accuracy**: For this metric, the value calculated is accuracy. Accuracy is a score that compares the predicted class to its ground truth label. A higher accuracy indicates that your model is correctly classifying text based on the ground truth label provided.
+ **Robustness**: For this metric, the value calculated is a percentage. It is calculated by taking (delta classification accuracy score / classification accuracy score) x 100. Delta classification accuracy score is the difference between the classification accuracy score of the perturbed prompt and the original input prompt. Each prompt in your dataset is perturbed approximately 5 times. Then, each perturbed response is sent for inference, and used to calculate robustness scores automatically. A lower score indicates the selected model is more robust.

# Review a human-based model evaluation job in Amazon Bedrock (console)
<a name="model-evaluation-report-human-customer"></a>

You can review the data for human evaluation presented in a report using the Amazon Bedrock console.

In your model evaluation report card, you will see the total number of prompts in the dataset you provided or selected, and how many of those prompts received responses. If the number of responses is less than the number of input prompts times the number of workers per prompt you configured in the job (either 1,2 or 3), make sure to check the data output file in your Amazon S3 bucket. It is possible that the prompt caused an error with the model and there was no inference retrieved. Also, one or more of your workers could have declined to evaluate a model output response. Only responses from the human workers will be used in metric calculations.

Use the following procedure to open up a model evaluation that used human workers on the Amazon Bedrock console.

1. Open the Amazon Bedrock console.

1. From the navigation pane, choose **Model evaluation**.

1. Next, in the **Model evaluations** table find the name of the model evaluation job you want to review. Then, choose it.

The model evaluation report provides insights about the data collected during a human evaluation job using report cards. Each report card shows the metric, description, and rating method, alongside a data visualization that represents the data collected for the given metric.

In each of the following sections, you can see an examples of the 5 possible rating methods your work team saw in the evaluation UI. The examples also show what key value pair is used to save the results in Amazon S3.

## Likert scale, comparison of multiple model outputs
<a name="likert-compare"></a>

Human evaluators indicate their preference between the two responses from the model on a 5 point Likert scale in accordance with your instructions. The results in the final report will be shown as a histogram of preference strength ratings from the evaluators over your whole dataset.

Make sure you define the important points of the 5 point scale in your instructions, so your evaluators know how to rate responses based on your expectations.

![\[This is my image.\]](http://docs.aws.amazon.com/bedrock/latest/userguide/images/model-evaluation/likert-compare.png)


**JSON output**  
The first child-key under `evaluationResults` is where the selected rating method is returned. In the output file saved to your Amazon S3 bucket, the results from each worker are saved to the `"evaluationResults": "comparisonLikertScale"` key value pair.

## Choice buttons (radio button)
<a name="choice-buttons"></a>

Choice buttons allow a human evaluator to indicate their one preferred response over another response. Evaluators indicate their preference between two responses according to your instructions with radio buttons. The results in the final report will be shown as a percentage of responses that workers preferred for each model. Be sure to explain your evaluation method clearly in the instructions.

![\[This is my image.\]](http://docs.aws.amazon.com/bedrock/latest/userguide/images/model-evaluation/radiobuttons.png)


**JSON output**  
The first child-key under `evaluationResults` is where the selected rating method is returned. In the output file saved to your Amazon S3 bucket, the results from each worker are saved to the `"evaluationResults": "comparisonChoice"` key value pair.

## Ordinal rank
<a name="ordinal-rank"></a>

Ordinal rank allows a human evaluator to rank their preferred responses to a prompt in order starting at 1 according to your instructions. The results in the final report will be shown as a histogram of the rankings from the evaluators over the whole dataset. Be sure to define what a rank of 1 means in your instructions. This data type is called Preference Rank.

![\[This is my image.\]](http://docs.aws.amazon.com/bedrock/latest/userguide/images/model-evaluation/ordinal-rank.png)


**JSON output**  
The first child-key under `evaluationResults` is where the selected rating method is returned. In the output file saved to your Amazon S3 bucket, the results from each worker are saved to the `"evaluationResults": "comparisonRank"` key value pair.

## Thumbs up/down
<a name="thump-up-down"></a>

Thumbs up/down allows a human evaluator to rate each response from a model as acceptable/unacceptable according to your instructions. The results in the final report will be shown as a percentage of the total number of ratings by evaluators that received a thumbs up rating for each model. You may use this rating method for a model evaluation job that contains one or more models. If you use this in an evaluation that contains two models, a thumbs up/down will be presented to your work team for each model response and the final report will show the aggregated results for each model individually. Be sure to define what is acceptable (that is, what is a thumbs up rating) in your instructions.

![\[This is my image.\]](http://docs.aws.amazon.com/bedrock/latest/userguide/images/model-evaluation/thumb-up.png)


**JSON output**  
The first child-key under `evaluationResults` is where the selected rating method is returned. In the output file saved to your Amazon S3 bucket, the results from each worker are saved to the `"evaluationResults": "thumbsUpDown"` key value pair.

## Likert scale, evaluation of a single model response
<a name="likert-single"></a>

Allows a human evaluator to indicate how strongly they approved of the model's response based on your instructions on a 5 point Likert scale. The results in the final report will be shown as a histogram of the 5 point ratings from the evaluators over your whole dataset. You may use this for an evaluation containing one or more models. If you select this rating method in an evaluation that contains more than one model, a 5 point Likert scale will be presented to your work team for each model response and the final report will show the aggregated results for each model individually. Be sure to define the important points on the 5 point scale in your instructions so your evaluators know how to rate the responses according to your expectations.

![\[This is my image.\]](http://docs.aws.amazon.com/bedrock/latest/userguide/images/model-evaluation/model-eval-likert2-final.png)


**JSON output**  
The first child-key under `evaluationResults` is where the selected rating method is returned. In the output file saved to your Amazon S3 bucket, the results from each worker are saved to the `"evaluationResults": "individualLikertScale"` key value pair.

# Understand how the results of your model evaluation job are saved in Amazon S3
<a name="model-evaluation-report-s3"></a>

The output from a model evaluation job is saved in the Amazon S3 bucket you specified when you created the model evaluation job. Results of model evaluation jobs are saved as JSON line files (`.jsonl`).

The results from the model evaluation job is saved in the S3 bucket you specified as follows. 
+ For model evaluation jobs that use human workers:

  `s3://user-specified-S3-output-path/job-name/job-uuid/datasets/dataset-name/file-uuid_output.jsonl` 
+ For automatic model evaluation jobs:

  `s3://user-specified-S3-output-path/job-name/job-uuid/models/model-id/taskTypes/task-type/datasets/dataset/file-uuid_output.jsonl`

For model or RAG evaluation jobs with one or more custom metrics, Amazon Bedrock also stores your metric definitions as JSON files in the output S3 bucket you specify. You can access these files by navigating to `s3://S3-output-bucket-name/job-name/job-uuid/custom_metrics`. To see the format for JSON definitions of custom metrics, see [Creating a JSON file to create a custom metric](model-evaluation-custom-metrics-prompt-formats.md#model-evaluation-custom-metrics-prompt-formats-json).

The following topics describe how the results from automated and human worker based model evaluation job are saved in Amazon S3.

## Output data from automated model evaluation jobs
<a name="model-evaluation-report-s3-automated"></a>

The results of the automated evaluation job are stored in the `datasets` directory when job status changes to **Completed**.

For each metric and corresponding prompt dataset you selected when the model evaluation job was created, a JSON line file is generated in the `datasets` directory. The file uses the following naming convention `metric_input-dataset.jsonl`.

Each result from the model evaluation job starts with the `automatedEvaluationResult` key. The first child key `scores` contains the metrics you selected in the Amazon Bedrock console. In this example, only a single metric `Accuracy` was selected. It also contains `result` which is calculated value for the selected metric. To learn more about what specific value is calculated, see [Review metrics for an automated model evaluation job in Amazon Bedrock (console)](model-evaluation-report-programmatic.md).

The second key, `inputRecord` is a copy of what you provided in your input prompt dataset.

The third key, `modelResponses` contains a list of JSON objects which contains the ARN of the model selected when you created the model evaluation job. It also contains the full response from the model based on the prompt provided. 

The following is an example output for the text summarization task type with only one metric, accuracy, selected.

```
{
"automatedEvaluationResult": {
    "scores": [{
        "metricName": "Accuracy",
        "result": 0.31920555233955383
    }]
},
"inputRecord": {
    "prompt": "Human: two bodies have been recovered from a flooded coal mine in yiyang city , central china 's henan province , as of #:## thursday morning , rescuers said . Assistant:",
    "referenceResponse": "two bodies recovered from flooded coal mine in c. china"
},
"modelResponses": [{
    "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-v2",
    "response": " Here are the key points I gathered from your news summary:\n\n- Two bodies have been recovered from a flooded coal mine in Yiyang City, in China's Henan province.\n\n- This happened as of Thursday morning, according to rescuers.\n\n- The location is central China's Henan province. \n\n- Yiyang City is where the flooded coal mine is located.\n\n- The coal mine was flooded, leading to the two deaths and need for rescue/recovery efforts.\n\n- This news is developing as of Thursday morning, so the death toll and rescue details may still be unfolding.\n\nPlease let me know if I have accurately summarized the key information or if you would like me to expand on any part of it."
}]
}
```

## Output data from model evaluation jobs that use human workers.
<a name="model-evaluation-report-s3-hil"></a>

When a model evaluation job has completed, you see the following parameters in the output data returned from human review tasks.


****  

| Parameter | Value Type | Example Values | Description | 
| --- | --- | --- | --- | 
| flowDefinitionArn |  String  | arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name |  The Amazon Resource Number (ARN) of the human review workflow (flow definition) used to create the human loop.   | 
| humanAnswers |  List of JSON objects  |  <pre>"answerContent": {<br />"evaluationResults": {<br />	"thumbsUpDown": [{<br />		"metricName": "Relevance",<br />		"modelResponseId": "0",<br />		"result": false<br />	}]<br />}<br />}</pre>  |  A list of JSON objects that contain worker responses in `answerContent`.   | 
|  `humanLoopName`  | String | system-generated-hash | A system generated 40-character hex string. | 
| inputRecord | JSON object |  <pre>"inputRecord": {<br />"prompt": "What does vitamin C serum do for skin?",<br />"category": "Skincare",<br />"referenceResponse": "Vitamin C serum offers a range of benefits for the skin. Firstly, it acts....<br />}</pre>  | A JSON object that contains an entry prompt from the input dataset.  | 
| modelResponses | List of JSON objects |  <pre>"modelResponses": [{<br />"modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id",<br />"response": "the-models-response-to-the-prompt"<br />}]</pre>  | The individual responses from the models. | 
| inputContent |  Object  | <pre>{<br />     "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json",<br />     "evaluationMetrics":[<br />        {<br />           "description":"testing",<br />           "metricName":"IndividualLikertScale",<br />           "ratingMethod":"IndividualLikertScale"<br />        }<br />     ],<br />     "instructions":"example instructions"<br />  }</pre> |  The human loop input content required to start human loop in your S3 bucket.  | 
| modelResponseIdMap |  Object  | <pre>{<br />     "0":"arn:aws:bedrock:us-west-2::foundation-model/model-id"<br />}</pre> |  `humanAnswers.answerContent.evaluationResults` contains `modelResponseId`s. The `modelResponseIdMap` connects the `modelResponseId` to the model name.  | 

<a name="code-sample-model-eval"></a>The following is an example of output data from a model evaluation job. 

```
{
"humanEvaluationResult": [{
    "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name",
    "humanAnswers": [{
        "acceptanceTime": "2023-11-09T19:17:43.107Z",
        "answerContent": {
            "evaluationResults": {
                "thumbsUpDown": [{
                    "metricName": "Coherence",
                    "modelResponseId": "0",
                    "result": false
                }, {
                    "metricName": "Accuracy",
                    "modelResponseId": "0",
                    "result": true
                }],
                "individualLikertScale": [{
                    "metricName": "Toxicity",
                    "modelResponseId": "0",
                    "result": 1
                }]
            }
        },
        "submissionTime": "2023-11-09T19:17:52.101Z",
        "timeSpentInSeconds": 8.994,
        "workerId": "444455556666",
        "workerMetadata": {
            "identityData": {
                "identityProviderType": "Cognito",
                "issuer": "https://cognito-idp.AWS Region.amazonaws.com/AWS Region_111222",
                "sub": "c6aa8eb7-9944-42e9-a6b9-"
            }
        }
    }],



				...Additional response have been truncated for clarity...

  }],

 "humanLoopName": "b3b1c64a2166e001e094123456789012",
 "inputContent":{
     "additionalDataS3Uri":"s3://user-specified-S3-output-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json",
     "evaluationMetrics":[
        {
           "description":"testing",
           "metricName":"IndividualLikertScale",
           "ratingMethod":"IndividualLikertScale"
        }
     ],
     "instructions":"some dummy instructions"
  },
  "modelResponseIdMap":{
     "0":"arn:aws:bedrock:AWS Region::foundation-model/model-id"
  }
},
"inputRecord": {
    "category": "Fitness",
    "prompt": "Human: What is High-Intensity Interval Training ? \\n\\nAssistant:",
    "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods. In HIIT, you perform exercises at near-maximum effort during the high-intensity intervals, which can include activities like sprinting, cycling, jumping, or body weight exercises such as burpees or push-ups. These intense intervals are interspersed with short recovery periods to lower your heart rate before the next high-intensity bout. HIIT workouts usually consist of multiple cycles of these intervals, and the number of cycles and the length of intervals can vary. HIIT is known for its time efficiency, making it an excellent choice for those with busy schedules. It offers both aerobic and anaerobic benefits, improving cardiovascular fitness, burning calories, and promoting fat loss, all while boosting metabolism for continued calorie burning post-workout."
},
"modelResponses": [{
    "modelIdentifier": "arn:aws:bedrock:AWS Region::foundation-model/model-id",
    "response": " High-intensity interval training (HIIT) is a form of exercise that involves short bursts of intense activity alternated with recovery periods. Here are some key features of HIIT:\n\n- Alternates short periods of intense anaerobic exercise with less intense recovery periods. The high-intensity periods are typically done at 80-95% of maximum heart rate.\n\n- The duration of the intense periods can range from 5 seconds to 8 minutes, but typically are in the range of 30 seconds to 4 minutes.\n\n- The total HIIT session typically lasts between 10-30 minutes, including warm-up, recovery periods, and cool down. \n\n- Uses exercises like sprints, cycling, rowing, jumping rope, stair climbing, swimming, etc. Movements that get your heart rate up quickly.\n\n- The short, intense intervals help improve cardiovascular fitness by increasing heart rate and oxygen consumption.\n\n- HIIT burns more calories and fat than steady-state cardio in a shorter time period. It also helps build muscle strength and endurance.\n\n- Considered an efficient and effective form of exercise for fat loss and improving aerobic power. But it requires motivation to push yourself during the intense intervals.\n\n- Not suitable for beginners due to the high-intensity. Start with steady-state cardio and build up endurance before trying HIIT.\n\nIn summary, HIIT intersperses intense bursts of"
}]
}


}
```

The following table explains how the **Rating method** you selected for each metric in the Amazon Bedrock console is returned in your Amazon S3 bucket. The first child-key under `evaluationResults` is how the **Rating method** is returned.




**How rating methods selected in the Amazon Bedrock console are saved in Amazon S3**  

| Rating method selected | Saved in Amazon S3 | 
| --- | --- | 
| Likert scale - Individual | IndividualLikertScale | 
| Likert scale - Comparison | ComparisonLikertScale | 
| Choice buttons | ComparisonChoice | 
| Ordinal rank | ComparisonRank | 
| Thumbs up/down | ThumbsUpDown | 