

# Model evaluation job submission
<a name="model-customize-open-weight-evaluation"></a>

This section describes open-weight custom model evaluation. It gets you started through a walk through of the evaluation job submission process. Additional resources are provided for more advanced evaluation job submission use cases.

**Topics**
+ [Getting Started](model-customize-evaluation-getting-started.md)
+ [Evaluation types and Job Submission](model-customize-evaluation-types.md)
+ [Evaluation Metrics Formats](model-customize-evaluation-metrics-formats.md)
+ [Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks](model-customize-evaluation-dataset-formats.md)
+ [Evaluate with Preset and Custom Scorers](model-customize-evaluation-preset-custom-scorers.md)

# Getting Started
<a name="model-customize-evaluation-getting-started"></a>

## Submit an Evaluation Job Through SageMaker Studio
<a name="model-customize-evaluation-studio"></a>

### Step 1: Navigate to Evaluation From Your Model Card
<a name="model-customize-evaluation-studio-step1"></a>

After you customize your model, navigate to the evaluation page from your model card.

For information on open-weight custom model training: [https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html)

SageMaker visualizes your customized model on the My Models tab:

![\[Registered model card page\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


Choose View latest version, then choose Evaluate:

![\[Model customization page\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### Step 2: Submit Your Evaluation Job
<a name="model-customize-evaluation-studio-step2"></a>

Choose the Submit button and submit your evaluation job. This submits a minimal MMLU benchmark job.

For information on the supported evaluation job types, see [Evaluation types and Job Submission](model-customize-evaluation-types.md).

![\[Evaluation job submission page\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### Step 3: Track Your Evaluation Job Progress
<a name="model-customize-evaluation-studio-step3"></a>

Your evaluation job progress is tracked in the Evaluation steps tab:

![\[Your evaluation job progress\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### Step 4: View Your Evaluation Job Results
<a name="model-customize-evaluation-studio-step4"></a>

Your evaluation job results are visualized in the Evaluation results tab:

![\[Your evaluation job metrics\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### Step 5: View Your Completed Evaluations
<a name="model-customize-evaluation-studio-step5"></a>

Your completed evaluation job is displayed in Evaluations of your model card:

![\[Your completed evaluation jobs\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## Submit Your Evaluation Job Through SageMaker Python SDK
<a name="model-customize-evaluation-sdk"></a>

### Step 1: Create Your BenchMarkEvaluator
<a name="model-customize-evaluation-sdk-step1"></a>

Pass your registered trained model, AWS S3 output location, and MLFlow resource ARN to `BenchMarkEvaluator` and then initialize it.

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### Step 2: Submit Your Evaluation Job
<a name="model-customize-evaluation-sdk-step2"></a>

Call the `evaluate()` method to submit the evaluation job.

```
execution = evaluator.evaluate()
```

### Step 3: Track Your Evaluation Job Progress
<a name="model-customize-evaluation-sdk-step3"></a>

Call the `wait()` method of the execution to get a live update of the evaluation job progress.

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### Step 4: View Your Evaluation Job Results
<a name="model-customize-evaluation-sdk-step4"></a>

Call the `show_results()` method to display your evaluation job results.

```
execution.show_results()
```

# Evaluation types and Job Submission
<a name="model-customize-evaluation-types"></a>

## Benchmarking with standardized datasets
<a name="model-customize-evaluation-benchmarking"></a>

Use the Benchmark Evaluation type to evaluate the quality of your model across standardized benchmark datasets including popular datasets like MMLU and BBH.


| Benchmark | Custom Dataset Supported | Modalities | Description | Metrics | Strategy | Subtask available | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | No | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs\$1cot | Yes | 
| mmlu\$1pro | No | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs\$1cot | No | 
| bbh | No | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | fs\$1cot | Yes | 
| gpqa | No | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs\$1cot | No | 
| math | No | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | No | Text | Quality-Control Task – Tests the model's ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes | 
| ifeval | No | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No | 

For more information on BYOD formats, see [Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks](model-customize-evaluation-dataset-formats.md).

### Available Subtasks
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), StrongReject, and MATH. These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.

**MMLU Subtasks**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH Subtasks**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**Math Subtasks**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject Subtasks**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### Submit your benchmark job
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[A minimal configuration for benchmarking through SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

For more information on evaluation job submission through SageMaker Python SDK, see: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## Large Language Model as a Judge (LLMAJ) evaluation
<a name="model-customize-evaluation-llmaj"></a>

Use LLM-as-a-judge (LLMAJ) evaluation to leverage another frontier model to grade your target model responses. You can use AWS Bedrock models as judges by calling `create_evaluation_job` API to launch the evaluation job.

For more information on the supported judge models see: [https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

You can use 2 different metric formats to define the evaluation:
+ **Builtin metrics:** Leverage AWS Bedrock builtin metrics to analyze the quality of your model's inference responses. For more information, see: [https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **Custom metrics:** Define your own custom metrics in Bedrock Evaluation custom metric format to analyze the quality of your model's inference responses using your own instruction. For more information, see: [https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### Submit a builtin metrics LLMAJ job
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[A minimal configuration for LLMAJ benchmarking through SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

For more information on evaluation job submission through SageMaker Python SDK, see: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### Submit a custom metrics LLMAJ job
<a name="model-customize-evaluation-llmaj-custom"></a>

Define your custom metric(s):

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

For more information, see: [https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[Upload the custom metric via Custom metrics > Add custom metrics\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## Custom Scorers
<a name="model-customize-evaluation-custom-scorers"></a>

Define your own custom scorer function to launch an evaluation job. The system provides two built-in scorers: Prime math and Prime code. You can also bring your own scorer function. You can copy your scorer function code directly or bring your own Lambda function definition using the associated ARN. By default, both scorer types produce evaluation results which include standard metrics such as F1 score, ROUGE, and BLEU.

For more information on built-in and custom scorers and their respective requirements/contracts, see [Evaluate with Preset and Custom Scorers](model-customize-evaluation-preset-custom-scorers.md).

### Register your dataset
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

Bring your own dataset for custom scorer by registering it as a SageMaker Hub Content Dataset.

------
#### [ SageMaker Studio ]

In Studio, upload your dataset using the dedicated Datasets page..

![\[Registered evaluation dataset in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

In the SageMaker Python SDK, upload your dataset using the dedicated Datasets page..

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### Submit a built-in scorer job
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[Select from Code executions or Math answers for Built-In custom scoring\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Select from `BuiltInMetric.PRIME_MATH` or `BuiltInMetric.PRIME_CODE` for Built-In Scoring.

------

### Submit a custom scorer job
<a name="model-customize-evaluation-custom-scorers-custom"></a>

Define a custom reward function. For more information, see [Custom Scorers (Bring Your Own Metrics)](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom).

**Register the custom reward function**

------
#### [ SageMaker Studio ]

![\[Navigating to SageMaker Studio > Assets > Evaluator > Create evaluator > Create reward function\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[Submit the Custom Scorer evaluation job referencing the registered preset reward function in Custom Scorer > Custom metrics\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# Evaluation Metrics Formats
<a name="model-customize-evaluation-metrics-formats"></a>

Evaluating the quality of your model across these metric formats:
+ Model Evaluation Summary
+ MLFlow
+ TensorBoard

## Model Evaluation Summary
<a name="model-customize-evaluation-metrics-summary"></a>

When you submit your evaluation job you specify an AWS S3 output location. SageMaker automatically uploads the evaluation summary .json file to the location. The benchmark summary S3 path is the following:

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

**Pass the AWS S3 location**

------
#### [ SageMaker Studio ]

![\[Pass into output artifact location (AWS S3 URI)\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

Read it directly as a `.json` from the AWS S3 location or visualized automatically in the UI:

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[Sample performance metrics for custom gen-qa benchmark visualized in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow logging
<a name="model-customize-evaluation-metrics-mlflow"></a>

**Provide your SageMaker MLFlow resource ARN**

SageMaker Studio uses the default MLFlow app that gets provisioned on each Studio domain when you use the model customization capability for the first time. SageMaker Studio uses the default MLflow app associated ARN in evaluation job submission.

You can also submit your evaluation job and explicitly provide an MLFlow Resource ARN to stream metrics to said associated tracking server/app for real time analysis.

**SageMaker Python SDK**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Model level and system level metric visualization:

![\[Sample model level error and accuracy for MMLU benchmarking task\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[Sample built-in metrics for LLMAJ benchmarking task\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[Sample system level metrics for MMLU benchmarking task\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

Submit your evaluation job with an AWS S3 output location. SageMaker automatically uploads a TensorBoard file to the location.

SageMaker uploads the TensorBoard file to AWS S3 in the following location:

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**Pass the AWS S3 location as follows**

------
#### [ SageMaker Studio ]

![\[Pass into output artifact location (AWS S3 URI)\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**Sample model level metrics**

![\[SageMaker TensorBoard displaying results of a benchmarking job\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks
<a name="model-customize-evaluation-dataset-formats"></a>

The Custom Scorer and LLM-as-judge evaluation types require a custom dataset JSONL file located in AWS S3. You must provide the file as a JSON Lines file adhering to one of the following supported formats. The examples in this doc are expanded for clarity.

Each format has its own nuances but at a minimum all require a user prompt.


**Required Fields**  

| Field | Required | 
| --- | --- | 
| User Prompt | Yes | 
| System Prompt | No | 
| Ground truth | Only for Custom Scorer | 
| Category | No | 

**1. OpenAI Format**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2. SageMaker Evaluation**

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3. HuggingFace Prompt Completion**

Both Standard and Conversational formats are supported.

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4. HuggingFace Preference**

Support for both standard format (string) and conversational format (messages array).

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5. Verl Format**

The Verl format (both current and legacy formats) is supported for reinforcement learning use cases. Verl docs for reference: [https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

Users of the VERL format typically do not provide a ground truth response. If you want to provide one anyways, use either of the fields `extra_info.answer` or `reward_model.ground_truth`; `extra_info` takes precedence.

SageMaker preserves the following VERL-specific fields as metadata if present:
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# Evaluate with Preset and Custom Scorers
<a name="model-customize-evaluation-preset-custom-scorers"></a>

When using the Custom Scorer evaluation type, SageMaker Evaluation supports two built-in scorers (also referred to as "reward functions") Prime Math and Prime Code taken from the [volcengine/verl](https://github.com/volcengine/verl) RL training library, or your own custom scorer implemented as a Lambda Function.

## Built-in Scorers
<a name="model-customize-evaluation-builtin-scorers"></a>

**Prime Math**

The prime math scorer expects a custom JSONL dataset of entries containing a math question as the prompt/query and the correct answer as ground truth. The dataset can be any of the supported formats mentioned in [Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks](model-customize-evaluation-dataset-formats.md).

Example dataset entry (expanded for clarity):

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**Prime Code**

The prime code scorer expects a custom JSONL dataset of entries containing a coding problem and test cases specified in the `metadata` field. Structure the test cases with the expected function name for each entry, sample inputs, and expected outputs.

Example dataset entry (expanded for clarity):

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## Custom Scorers (Bring Your Own Metrics)
<a name="model-customize-evaluation-custom-scorers-byom"></a>

Fully customize your model evaluation workflow with custom post-processing logic which allows you to compute custom metrics tailored to your needs. You must implement your custom scorer as an AWS Lambda function that accepts model responses and returns reward scores.

### Sample Lambda Input Payload
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

Your custom AWS Lambda expects inputs in the OpenAI format. Example:

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Sample Lambda Output Payload
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

The SageMaker evaluation container expects your Lambda responses to follow this format:

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### Custom Lambda Definition
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

Find an example of a fully-implemented custom scorer with sample input and expected output at: [https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html\$1nova-reward-llm-judge-example](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example)

Use the following skeleton as a starting point for your own function.

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### Input and output fields
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**Input fields**


| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Unique identifier for the sample | Echoed back in output. String format | 
| messages | Ordered chat history in OpenAI format | Array of message objects | 
| messages[].role | Speaker of the message | Common values: "user", "assistant", "system" | 
| messages[].content | Text content of the message | Plain string | 
| metadata | Free-form information to aid grading | Object; optional fields passed from training data | 

**Output fields**


**Output Fields**  

| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Same identifier as input sample | Must match input | 
| aggregate\$1reward\$1score | Overall score for the sample | Float (e.g., 0.0–1.0 or task-defined range) | 
| metrics\$1list | Component scores that make up the aggregate | Array of metric objects | 

### Required Permissions
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

Ensure that the SageMaker execution role you use to run evaluation has AWS Lambda permissions.

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

Ensure your AWS Lambda Function's execution role has basic Lambda execution permissions, as well as additional permissions you may require for any downstream AWS calls.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```