

# Evaluating your SageMaker AI-trained model
Evaluation

The purpose of the evaluation process is to assess trained-model performance against benchmarks or custom dataset. The evaluation process typically involves steps to create evaluation recipe pointing to the trained model, specify evaluation datasets and metrics, submit a separate job for the evaluation, and evaluate against standard benchmarks or custom data. The evaluation process will output performance metrics stored in your Amazon S3 bucket.

**Note**  
The evaluation process described in this topic is an offline process. The model is tested against fixed benchmarks with predefined answers, rather than being assessed in real-time or through live user interactions. For real-time evaluation, you can test the model after it has been deployed to Amazon Bedrock by calling [Amazon Bedrock](https://docs.aws.amazon.com//bedrock/latest/userguide/import-with-create-custom-model.html) Runtime APIs.

**Topics**
+ [

## Prerequisites
](#nova-model-evaluation-prerequisites)
+ [

## Available benchmark tasks
](#nova-model-evaluation-benchmark)
+ [

## Evaluation specific configurations
](#nova-model-evaluation-config)
+ [

## Running evaluation training jobs
](#nova-model-evaluation-notebook)
+ [

## Assessing and analyzing evaluation results
](#nova-model-evaluation-assess)
+ [

## Evaluation best practices and troubleshooting
](#nova-model-evaluation-best-practices)
+ [

## Available subtasks
](#nova-model-evaluation-subtasks)
+ [

# Reasoning model evaluation
](nova-reasoning-model-evaluation.md)
+ [

# RFT evaluation
](nova-rft-evaluation.md)
+ [

# Implementing reward functions
](nova-implementing-reward-functions.md)

## Prerequisites
Prerequisites

Before you start a evaluation training job, note the following.
+ A SageMaker AI-trained Amazon Nova model which you want to evaluate its performance.
+ Base Amazon Nova recipe for evaluation. For more information, see [Getting Amazon Nova recipes](nova-model-recipes.md#nova-model-get-recipes).

## Available benchmark tasks
Available benchmark tasks

A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of available industry standard benchmarks supported. You can specify the following benchmarks in the `eval_task` parameter.

**Available benchmarks for model evaluation**


| Benchmark | Modality | Description | Metrics | Strategy | Subtask available | 
| --- | --- | --- | --- | --- | --- | 
| mmlu |  Text  |  Multi-task Language Understanding – Tests knowledge across 57 subjects.  |  accuracy  | zs\$1cot | Yes | 
| mmlu\$1pro | Text |  MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering.  | accuracy | zs\$1cot | No | 
| bbh | Text |  Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills.  | accuracy | fs\$1cot | Yes | 
| gpqa | Text |  General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities.  | accuracy | zs\$1cot | No | 
| math | Text |  Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems.  | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text |  Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content.  | deflection | zs | Yes | 
| ifeval | Text |  Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification.  | accuracy | zs | No | 
| gen\$1qa | Multi-Modal (image) |  Custom Dataset Evaluation – Lets you supply your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. `gen_qa` supports image inference for Amazon Nova Lite or Amazon Nova Pro based models. Also supports Bring-Your-Own Metrics lambda. (For RFT evaluation, please use RFT eval recipe)  | all | gen\$1qa | No | 
| mmmu | Multi-Modal |  Massive Multidiscipline Multimodal Understanding (MMMU) – College-level benchmark comprising multiple-choice and open-ended questions from 30 disciplines.)  | accuracy | zs\$1cot | Yes | 
| llm\$1judge | Text |  LLM-as-a-Judge Preference Comparison – Uses a Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A.  | all | judge | No | 
|  mm\$1llm\$1judge  | Multi-Modal (image) |  This new benchmark behaves the same as the text-based `llm_judge`above. The only difference is that it supports image inference.  | all | judge | No | 
|  rubric\$1llm\$1judge  |  Text  |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  |  Text  |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  |  No  | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 
|  humaneval  | Text |  HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models  |  pass@1  | zs | No | 

## Evaluation specific configurations


Below is a breakdown of the key components in the recipe and guidance on how to modify them for your use cases.

### Understanding and modifying your recipes


**General run configuration**

```
run:
  name: eval_job_name 
  model_type: amazon.nova-2-lite-v1:0:256k 
  model_name_or_path: nova-lite-2/prod # or s3://escrow_bucket/model_location
  replicas: 1 
  data_s3_path: ""
  mlflow_tracking_uri: "" 
  mlflow_experiment_name : "" 
  mlflow_run_name : ""
```
+ `name`: A descriptive name for your evaluation job.
+ `model_type`: Specifies the Nova model variant to use. Do not manually modify this field. Options include:
  + amazon.nova-micro-v1:0:128k
  + amazon.nova-lite-v1:0:300k
  + amazon.nova-pro-v1:0:300k
  + amazon.nova-2-lite-v1:0:256k
+ `model_name_or_path`: The path to the base model or s3 path for post trained checkpoint. Options include:
  + nova-micro/prod
  + nova-lite/prod
  + nova-pro/prod
  + nova-lite-2/prod
  + S3 path for post trained checkpoint path (`s3:customer-escrow-111122223333-smtj-<unique_id>/<training_run_name>`)
**Note**  
**Evaluate post-trained model**  
To evaluate a post-trained model after a Nova SFT training job, follow these steps after running a successful training job. At the end of the training logs, you will see the log message "Training is complete". You will also find a `manifest.json` file in your output bucket containing the location of your checkpoint. This file will be located within an `output.tar.gz` file at your output S3 location. To proceed with evaluation, use this checkpoint by setting it as the value for `run.model_name_or_path` in your recipe configuration.
+ `replica`: The number of compute instances to use for distributed inference (running inference across multiple nodes). Set `replica` > 1 to enable multi-node inference, which accelerates evaluation. If both `instance_count` and `replica` are specified, `instance_count` takes precedence. Note that multiple replicas only apply to SageMaker AI training jobs, not . 
+ `data_s3_path`: The input dataset Amazon S3 path. This field is required but should always left empty.
+ `mlflow_tracking_uri`: (Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

**Evaluation configuration**

```
evaluation:
  task: mmlu 
  strategy: zs_cot 
  subtask: abstract_algebra
  metric: accuracy
```
+ `task`: Specifies the evaluation benchmark or task to use. Supported task includes:
  + `mmlu`
  + `mmlu_pro`
  + `bbh`
  + `gpqa`
  + `math`
  + `strong_reject`
  + `gen_qa`
  + `ifeval`
  + `mmmu`
  + `llm_judge`
  + `mm_llm_judge`
  + `rubric_llm_judge`
  + `aime_2024`
  + `calendar_scheduling`
  + `humaneval`
+ `strategy`: Defines the evaluation approach.
  + `zs_cot`: Zero-shot Chain of Thought - an approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + `fs_cot`: Few-shot Chain of Thought - an approach that provides a few examples of step-by-step reasoning before asking the model to solve a new problem.
  + `zs`: Zero-shot - an approach to solve a problem without any prior training examples.
  + `gen_qa`: Strategy specific for bring your own dataset.
  + `judge`: Strategy specific for Nova LLM as Judge and `mm_llm_judge`.
+ `subtask`: Optional. Specific components of the evaluation task. For a complete list of available subtasks, see [Available subtasks](#nova-model-evaluation-subtasks).
  + Check supported subtasks in Available benchmarks tasks.
  + Should remove this field if there are no subtasks benchmarks.
+ `metric`: The evaluation metric to use.
  + `accuracy`: Percentage of correct answers.
  + `exact_match`: For math benchmark, returns the rate at which the input predicted strings exactly match their references.
  + `deflection`: For strong reject benchmark, returns relative deflection to base model and difference significance metrics.
  + `all`:

    For `gen_qa`, bring your own dataset benchmark, return following metrics:
    + `rouge1`: Measures overlap of unigrams (single words) between generated and reference text.
    + `rouge2`: Measures overlap of bigrams (two consecutive words) between generated and reference text.
    + `rougeL`: Measures longest common subsequence between texts, allowing for gaps in the matching.
    + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
    + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
    + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
    + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
    + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.

    For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
    + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
    + `a_scores_stderr`: Standard error of `response_A_scores` across pairwise judgements.
    + `b_scores`: Measures Number of wins for `response_B` across forward and backward evaluation passes.
    + `a_scores_stderr`: Standard error of `response_B_scores` across pairwise judgements.
    + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
    + `ties_stderr`: Standard error of `ties` across pairwise judgements.
    + `inference_error`: Count of judgements that could not be properly evaluated.
    + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
    + `score_stderr`: Aggregate score based on wins from both forward and backward passes for `response_B`.
    + `inference_error_stderr`: Standard error of the aggregate score across pairwise judgements.
    + `winrate`: The probability that `response_B` will be preferred over `response_A` calculated using Bradley-Terry probability.
    + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.
    + `upper_rate`: Upper bound (97.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration (optional)**

```
inference:
  max_new_tokens: 2048 
  top_k: -1 
  top_p: 1.0 
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: Maximum number of tokens to generate. Must be an integer. (Unavailable for LLM Judge)
+ `top_k`: Number of the highest probability tokens to consider. Must be an integer.
+ `top_p`: Cumulative probability threshold for token sampling. Must be a float between 1.0 to 0.0.
+ `temperature`: Randomness in token selection (higher = more random), keep 0 to make the result deterministic. Float type, minimal value is 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are null (default value if not set; disables reasoning), low, or high.

### Log Probability Output Format


When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

### Evaluation recipe examples


Amazon Nova provides four different types of evaluation recipes. All recipes are available in [SageMaker HyperPod recipes GitHub repository](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).

**Topics**

#### General text benchmark recipes


These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. 

Recipe format: `xxx_general_text_benchmark_eval.yaml`.

#### Bring your own dataset benchmark recipes


These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. 

Recipe format: `xxx_ bring_your_own_dataset_eval.yaml`.

**Bring your own dataset requirements**

File format: 
+ Single `gen_qa.jsonl` file containing evaluation examples. The file name should be exact `gen_qa.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access.
+ The file must follow the required schema format for general Q&A dataset.

Schema format requirements - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `query`: String containing the question or instruction that needs an answer.

  `response`: String containing the expected model output.
+ Optional fields.

  `system`: String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query.

  `images`: Array containing a list of objects with data attributes (Base64 encoded image strings).

  `metadata`: String containing metadata associated with the entry for tagging purposes.

**Example entry**

```
{
"system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}{
"system":"You are a pattern analysis specialist who provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}{
"system":"You have great attention to detail and follow instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}{
"system": "Image inference: ",
  "query": "What is the number in the image? Please just use one English word to answer.",
  "response": "two",
  "images": [
    {
      "data": "data:image/png;Base64,iVBORw0KGgoA ..."
    }
  ]
}
```

To use your custom dataset, modify your evaluation recipe by adding the following required fields without changing the existing configuration:

```
evaluation:
  task: gen_qa 
  strategy: gen_qa 
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.

##### Bring your own metrics


You can bring your own metrics to fully customize your model evaluation workflow with custom preprocessing, postprocessing, and metrics capabilities. Preprocessing allows you to process input data before sending it to the inference server, and postprocessing allows you to customize metrics calculation and return custom metrics based on your needs.

Follow these steps to bring your own metrics with custom evaluation SDK.

1. If you haven't done so, [create an AWS Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/getting-started.html) in your AWS account first.

1. Download the pre-built `nova-custom-eval-layer.zip` file from the [GitHub repository](https://github.com/aws/nova-custom-eval-sdk/releases). You can use this open-source Nova custom evaluation SDK to validate input and output payloads for your custom function and provide a unified interface for integrating with Nova's bring your own metrics evaluation during training.

1. Upload the custom Lambda layer using the following command:

   ```
   aws lambda publish-layer-version \
       --layer-name nova-custom-eval-layer \
       --zip-file fileb://nova-custom-eval-layer.zip \
       --compatible-runtimes python3.12 python3.11 python3.10 python3.9
   ```

1. Add this layer as a custom layer to your Lambda function, along with the required AWS layer: `AWSLambdaPowertoolsPythonV3-python312-arm64` (required for `pydantic` dependency).

1. Update your Lambda code using the provided example, modifying the code as needed. This example code creates a Lambda function for Nova's custom evaluation with preprocessing and postprocessing steps for model evaluation.

   ```
   from nova_custom_evaluation_sdk.processors.decorators import preprocess, postprocess
   from nova_custom_evaluation_sdk.lambda_handler import build_lambda_handler
   
   @preprocess
   def preprocessor(event: dict, context) -> dict:
       data = event.get('data', {})
       return {
           "statusCode": 200,
           "body": {
               "system": data.get("system"),
               "prompt": data.get("prompt", ""),
               "gold": data.get("gold", "")
           }
       }
   
   @postprocess
   def postprocessor(event: dict, context) -> dict:
       # data is already validated and extracted from event
       data = event.get('data', [])
       inference_output = data.get('inference_output', '')
       gold = data.get('gold', '')
       
       metrics = []
       inverted_accuracy = 0 if inference_output.lower() == gold.lower() else 1.0
       metrics.append({
           "metric": "inverted_accuracy_custom",
           "value": accuracy
       })
       
       # Add more metrics here
       
       return {
           "statusCode": 200,
           "body": metrics
       }
   
   # Build Lambda handler
   lambda_handler = build_lambda_handler(
       preprocessor=preprocessor,
       postprocessor=postprocessor
   )
   ```

1. Grant Lambda access to the evaluation job. Ensure the execution role specified for the evaluation job includes a policy the invoke your Lambda function. Here is an example policy.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "LambdaAccess",
               "Effect": "Allow",
               "Action": [
                   "lambda:InvokeFunction"
               ],
               "Resource": "arn:aws:lambda:us-east-1:111122223333:function:ExampleFunction",
               "Condition": {
                   "StringLike": {
                       "aws:PrincipalArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-ARN"
                   }
               }
           },
           {
               "Sid": "DenyNonAWSEventSourcesForLambda",
               "Effect": "Deny",
               "Action": [
                   "lambda:InvokeFunction"
               ],
               "Resource": "arn:aws:lambda:us-east-1:111122223333:function:ExampleFunction",
               "Condition": {
                   "Null": {
                       "lambda:EventSourceToken": false
                   }
               }
           }
       ]
   }
   ```

------

1. Review the Lambda payload schema. The following table lists the Lambda request and response schema. You can validate your schema using the Nova custom evaluation SDK.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/nova-model-evaluation.html)

1. Modify the recipe file. Here is an example. 

   ```
   processor:
     lambda_arn: arn:aws:lambda:us-east-1:111122223333:function:name
     lambda_type: "custom_metrics"
     preprocessing:
       enabled: true
     postprocessing:
       enabled: true
     aggregation: average
   ```
   + `lambda-arn`: The Amazon Resource Name (ARN) for your Lambda function that handles preprocessing and postprocessing.
   + `lambda_type`: "custom\$1metrics" or "rft".
   + `preprocessing`: Whether to enable custom pre-processing operations.
   + `postprocessing`: Whether to enable custom post-processing operations.
   + `aggregation`: Built-in aggregation function (valid options: min, max, average, sum).

**Limitations**
+ Bring your own metrics only applies to text input datasets.
+ Multi-modal input datasets are not supported. 
+ The preprocessing step does not process the metadata field.

#### Nova LLM as a Judge benchmark recipes


Nova LLM Judge is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, then uses a Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_llm_judge_eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `llm_judge.jsonl` file containing evaluation examples. The file name should be exact `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access.
+ The file must follow the required schema format for the `llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova Judge models are the same across micro / lite / pro specifications.
+ Custom judge models are not currently supported.

##### Nova LLM as a Judge for multi-modal (image) benchmark recipes


Nova LLM Judge for multi-modal (image), short for Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in thte form of Base64-encoded string, then uses a Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge_eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "What is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
} 
{
  "prompt": "How many animals are in each of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

#### Evaluating CPT (Continuous Pre-Training) Checkpoints


Evaluation on CPT (continuous pre-trained) models can be more of a challenge than models which have undergone SFT (supervised fine tuning) as CPT models regularly lack the ability to follow instructions. Instead of following instructions, CPT model will operate as completion models, meaning they will only attempt to continue the pattern provided with the input token sequence. Given this limitation, typical evaluation datasets may not work correctly due to the "Q&A" format of the inputs — instead of answering the question the model will simply try to continue along the same question. However, by formatting datasets in a specific way to prompt models in a completion style, we can get an understanding of how the model is performing.

Follow the below steps to understand how to perform an evaluation on a continuous pre-trained model using the Nova Forge evaluation workflow.

##### Dataset Preparation and Formatting


Evaluating a CPT model takes advantage of the already existing [Bring Your Own Dataset](#nova-model-evaluation-config-byo) workflow already provided in the Nova Forge model evaluation experience. However, queries within the dataset must be formatted in a purely "completion" format as CPT models will not respond to standard question-style prompts in the same manner an SFT model would.

Another important frequent limitation of models which have undergone solely undergone CPT is their inability to generate STOP or end of sequence tokens — this means that the model will continue to generate tokens until it is forcefully stopped (such as with the max\$1new\$1tokens parameter). Given this limitation, best practice is to evaluate the models using single token responses (such as multiple choice) to ensure the model doesn't continue to generate junk output that is not needed after prompting.

For example, a typical evaluation dataset (such as MMLU, GPQA, MATH, etc), might prompt the model with a question such as:

```
Early settlements and high population density along coastlines and rivers are 
best attributed to which of the following?
A: "Poor climate conditions"
B: "Limited forest cover"
C: "Cars"
D: "Access to trade routes" 

(Expected answer is D.)
```

However, a CPT model would not understand how to properly respond to this question due to the lack of fine tuning on instruction following. Therefore, CPT models must be prompted in a completion style, such as:

```
Early settlements and high population density along coastlines and rivers 
are best attributed to which of the following?
A: Poor climate conditions
B: Limited forest cover
C: Cars
D: Access to trade routes
The correct answer to this question is option 

(Expected answer is D.)
```

After inference checking the output logprobs generated by the model will provide details on if the model was correctly processing the input and generating the correct response. It is not guaranteed that the model will always produce the exact response (in this case, the letter D) expected, however, it should be be present within the logprobs if the model is functioning correctly.

Another non-multiple choice completion style prompt example:

```
The capital of France is

(Expected answer of Paris)
```

We would expect the model to either produce a response of "Paris" or see the token corresponding to "Paris" somewhere in the logprobs output.

##### Dataset Formatting


CPT evaluation takes advantage of the existing [Bring Your Own Dataset](#nova-model-evaluation-config-byo) workflow. Data must be formatted in the "query response" format in a JSONL file separated by new lines.

An example of a dataset with 4 entries in it:

```
{"query": "The capital of France is", "response": "Paris"}
{"query": "2 + 2 =", "response": "4"}
{"query": "The mitochondria is the powerhouse of the", "response": "cell"}
{"query": "What is the largest planet?\nA: Mars\nB: Jupiter\nC: Saturn\nD: Earth\nAnswer:", "response": "B"}
```

Each line must contain:
+ `query`: The prompt text for completion
+ `response`: The expected completion (ground truth)

The model will receive raw text input without chat formatting. CPT models are typically not trained on special tokens yet and will not respond properly to chat templates, so when prompting the model ONLY the string provided in the query will be sent to the model (with an additional `[BOS]` token prepended to it automatically.)

##### Recipe Configuration


Here is an example of a recipe which is configured for evaluating a CPT model:

```
run:
  name: cpt_eval_job
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: s3://bucket/path/to/cpt-checkpoint/

evaluation:
  task: gen_qa  # Required for CPT - bring your own dataset
  strategy: gen_qa
  metric: all  # Returns rouge1, rouge2, rougeL, exact_match, f1_score, bleu

inference:
  checkpoint_is_instruction_tuned: "false"  # Required for CPT checkpoints
  top_logprobs: 5 # Change to desired amount of logprobs to calculate
  max_new_tokens: 1 # Keep low for completion tasks
  temperature: 0.0
```

Key changes for CPT evaluation:
+ `checkpoint_is_instruction_tuned: "false"`

  This is a new parameter added specifically to support evaluation runs on CPT checkpoints. Setting checkpoint\$1is\$1instruction\$1tuned to false will **disable** the default Amazon Nova chat template that normally wraps the input prompt.
+ `top_logprobs: 5`

  Log probabilities (logprobs) reveal the model's confidence distribution across possible next tokens, helping you assess whether the model has learned the expected completions during pre-training. Typically, if the model is performing as intended, we should see the expected response (i.e, "A", "B", etc) as a either the generated output token or a token in the logprobs.
+ `max_new_tokens: 1`

  CPT models typically have not been trained yet on generating special "stop" or "end of sequence" tokens yet to signal when to stop inference. This means the model will typically continue to generate new tokens until the specific max token length is reached resulting in unnecessary inference. Typically limiting the max\$1new\$1tokens to 1 and providing a prompt which can evaluate the model on a single response (like a multiple choice question) is the most efficient way to prompt the model. Setting the max\$1new\$1tokens to 1 ensures that extra junk tokens aren't generated.

##### Key Parameters

+ **checkpoint\$1is\$1instruction\$1tuned**: Must be set to `"false"` (or `false` as boolean) to disable chat templates
+ **top\$1logprobs**: 5, recommended to see how the model is learning during CPT
+ **task**: Must be `gen_qa` - CPT models cannot use instruction-following benchmarks like MMLU or MATH
+ **strategy**: Must be `gen_qa`
+ **max\$1new\$1tokens**: Recommended to keep low (1-5) since CPT models perform completion, not generation

## Running evaluation training jobs
Evaluation training jobs

Start a training job using the following sample notebook. Please refer to below notebook as example to run the evaluation training job. For more information, see [Use a SageMaker AI estimator to run a training job](https://docs.aws.amazon.com//sagemaker/latest/dg/docker-containers-adapt-your-own-private-registry-estimator.html).

### Reference tables


Before running the notebook, refer to the following reference tables to select image URI and instance configurations.

**Selecting image URI**


| Recipe | Image URI | 
| --- | --- | 
|  Evaluation image URI  | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 

**Selecting instance type and count**


| Model | Job type | Instance type | Recommended instance count | Allowed instance count | 
| --- | --- | --- | --- | --- | 
| Amazon Nova Micro | Evaluation (SFT/DPO) |  g5.12xlarge  | 1 | 1 - 16 | 
| Amazon Nova Lite | Evaluation (SFT/DPO) |  g5.12xlarge  | 1 | 1 - 16 | 
| Amazon Nova Pro | Evaluation (SFT/DPO) |  p5.48xlarge  | 1 | 1 - 16 | 

### Sample notebook


The following sample notebook demonstrates how to run an evaluation training job.

```
# install python SDK

# Do not use sagemaker v3, as sagemaker v3 introduced breaking changes

!pip install sagemaker==2.254.1
 
import os
import sagemaker,boto3
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Download recipe from https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/evaluation/nova to local
# Assume the file name be `recipe.yaml`

# Populate parameters
# input_s3_uri = "s3://<path>/input/" # (Optional) Only used for multi-modal dataset or bring your own dataset s3 location
output_s3_uri= "s3://<path>/output/" # Output data s3 location, a zip containing metrics json and tensorboard metrics files will be stored to this location
instance_type = "instance_type"  # ml.g5.16xlarge as example
instance_count = 1 # The number of instances for inference (set instance_count > 1 for multi-node inference to accelerate evaluation)             
job_name = "your job name"
recipe_path = "recipe path" # ./recipe.yaml as example
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest" # Do not change
output_kms_key = "<KMS key arn to encrypt trained model in Amazon-owned S3 bucket>" # optional, leave blank for Amazon managed encryption

# (Optional) To bring your own dataset and LLM judge for evaluation
# evalInput = TrainingInput(
# s3_data=input_s3_uri,
# distribution='FullyReplicated',
# s3_data_type='S3Prefix'
#)

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    instance_count=instance_count,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    output_kms_key=output_kms_key
)
estimator.fit()

# If input dataset exist, pass in inputs
# estimator.fit(inputs={"train": evalInput})
```

## Assessing and analyzing evaluation results
Assessing evaluation results

After your evaluation job completes successfully, you can assess and analyze the results using the following steps.

**To assess and analyze the results, following these steps.**

1. Understand the output location structure. Results are stored in your specified Amazon S3 output location as a compressed file:

   ```
   s3://your-bucket/output/benchmark-name/
   └── job_name/
       └── output/
           └── output.tar.gz
   ```

1. Download the `output.tar.gz` file from your bucket. Extract the contents to reveal.

   ```
   run_name/
   ├── eval_results/
   |   └── results_[timestamp].json
   │   └── inference_output.jsonl (only present for gen_qa)
   |   └── details/
   |         └── model/
   |              └── <execution-date-time>/
   |                    └──details_<task_name>_#_<datetime>.parquet
   └── tensorboard_results/
       └── eval/
           └── events.out.tfevents.[timestamp]
   ```
   + `results_[timestamp].json` - Output metrics JSON file
   + `details_<task_name>_#_<datetime>.parquet` - Inference output file (except for `strong_reject`)
   + `events.out.tfevents.[timestamp]` - TensorBoard output file
   + `inference_output.jsonl` - Cleaned inference output file (only for `gen_qa` tasks)

1. View results in TensorBoard. To visualize your evaluation metrics: 

   1. Upload the extracted folder to an S3 bucket

   1. Navigate to SageMaker AI TensorBoard

   1. Select your "S3 folders"

   1. Add the S3 folder path

   1. Wait for synchronization to complete

1. Analyze inference outputs. All evaluation tasks, except `llm_judge` and `strong_reject`, will have the following fields for analysis in the inference output.
   + `full_prompt` - The full user prompt sent to the model used for the evaluation task.
   + `gold` - The field that contains the correct answer(s) as specified by the dataset.
   + `metrics` - The field that contains the metrics evaluated on the individual inference. Values that require aggregation would not have a value on the individual inference outputs.
   + `predictions` - The field that contains a list of the model’s output for the given prompt.
   + `pred_logits` - The field that contains the considered output tokens and log probabilities of each output token returned in the message content.

   By looking at these fields, you can determine the cause for metric differences and understand the behavior of the customized models.

   For `llm_judge`, the inference output file contains the following fields under the metrics field per pair of evaluations.
   + `forward_output` - Judge's raw preferences when evaluating in order (response\$1A, response\$1B).
   + `backward_output` - Judge's raw preferences when evaluating in reverse order (response\$1B, response\$1A).
   + `Pairwise metrics` - Metrics that are calculated per pair of forward and backward evaluation including `a_scores`, `b_scores`, `ties`, `inference-score` and `score`.
**Note**  
Aggregate metrics like `winrate` are only available in the summary results files, not per individual judgement.

   For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
   + prompt - The final prompt submitted to the model
   + inference - The raw inference output from the model
   + gold - The target response from the input dataset
   + metadata - The metadata string from the input dataset if provided

## Evaluation best practices and troubleshooting
Best practices and troubleshooting

### Best practices
Best practices

The following lists some best practices for the evaluation process.
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

### Troubleshooting
Troubleshooting

You can use CloudWatch log group `/aws/sagemaker/TrainingJobs` for training job error logs.

#### Engine core Failure


**Issue**: 

If you are seeing: 

```
RuntimeError: Engine core initialization failed.
```

**Cause**: 

Although this is a general error that can have multiple causes, it typically occurs when there is a mismatch between the model checkpoint you're trying to load and the model type specified. E.g. you want to evaluate a fine-tuned Nova 2.0 lite model checkpoint but the model type you provide is 1.0 model type. e.g. `amazon.nova-micro-v1:0:128k`

The correct mapping should be 

```
model_type: amazon.nova-2-lite-v1:0:256k
model_name_or_path: nova-lite-2/prod # or s3://escrow_bucket/model_location
```

**Prevention**: 

Double check the `model_name_or_path` is mapped to the right `model_type` before submitting the evaluation job.

## Available subtasks
Available subtasks

The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), MATH, and MMMU (Massive Multi-discipline Multimodal Understanding). These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.

**MMLU**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**Math**

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
```

**MMMU**

```
            MATH_SUBTASKS = [
    "Accounting",
    "Agriculture",
    "Architecture_and_Engineering",
    "Art",
    "Art_Theory",
    "Basic_Medical_Science",
    "Biology",
    "Chemistry",
    "Clinical_Medicine",
    "Computer_Science",
    "Design",
    "Diagnostics_and_Laboratory_Medicine",
    "Economics",
    "Electronics",
    "Energy_and_Power",
    "Finance",
    "Geography",
    "History",
    "Literature",
    "Manage",
    "Marketing",
    "Materials",
    "Math",
    "Mechanical_Engineering",
    "Music",
    "Pharmacy",
    "Physics",
    "Psychology",
    "Public_Health",
    "Sociology",
```

Evaluate your customized Nova models using various evaluation methods and metrics.

**Topics**
+ [

## Prerequisites
](#nova-model-evaluation-prerequisites)
+ [

## Available benchmark tasks
](#nova-model-evaluation-benchmark)
+ [

## Evaluation specific configurations
](#nova-model-evaluation-config)
+ [

## Running evaluation training jobs
](#nova-model-evaluation-notebook)
+ [

## Assessing and analyzing evaluation results
](#nova-model-evaluation-assess)
+ [

## Evaluation best practices and troubleshooting
](#nova-model-evaluation-best-practices)
+ [

## Available subtasks
](#nova-model-evaluation-subtasks)
+ [

# Reasoning model evaluation
](nova-reasoning-model-evaluation.md)
+ [

# RFT evaluation
](nova-rft-evaluation.md)
+ [

# Implementing reward functions
](nova-implementing-reward-functions.md)

# Reasoning model evaluation
Reasoning model evaluation

## Overview


Reasoning model support enables evaluation with reasoning-capable Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**
+ amazon.nova-2-lite-v1:0:256k

## Recipe configuration


Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:  
  name: reasoning-eval-job-name                          # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model  
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker HyperPod job  
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker HyperPod job (not compatible with SageMaker Training jobs)  
  
evaluation:  
  task: mmlu                                             # [MODIFIABLE] Evaluation task  
  strategy: zs_cot                                       # [MODIFIABLE] Evaluation strategy  
  metric: accuracy                                       # [MODIFIABLE] Metric calculation method  
  
inference:  
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/high or null to disable  
  max_new_tokens: 32768                                  # [MODIFIABLE] Maximum tokens to generate, recommended value when reasoning_effort set to high  
  top_k: -1                                              # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

## Using the reasoning\$1effort parameter


The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

### Prerequisites

+ **Model compatibility** – Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling** – Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

### Available options



| Option | Behavior | Token limit | Use case | 
| --- | --- | --- | --- | 
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 


| Training method | Available options | How to configure | 
| --- | --- | --- | 
| SFT (Supervised Fine-Tuning) | High or Off only | Use reasoning\$1enabled: true (high) or reasoning\$1enabled: false (off) | 
| RFT (Reinforcement Fine-Tuning) | Low, High, or Off | Use reasoning\$1effort: low or reasoning\$1effort: high. Omit field to disable. | 
| Evaluation | Low, High, or Off | Use reasoning\$1effort: low or reasoning\$1effort: high. Use null to disable. | 

### When to enable reasoning


**Use reasoning mode (`low` or `high`) for**
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (`null` or omit parameter) for**
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

### Troubleshooting


**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# RFT evaluation
RFT evaluation

## What is RFT evaluation?


RFT Evaluation allows you to assess your model's performance using custom reward functions before, during, or after reinforcement learning training. Unlike standard evaluations that use pre-defined metrics, RFT Evaluation lets you define your own success criteria through a Lambda function that scores model outputs based on your specific requirements.

## Why evaluate with RFT?


Evaluation is crucial to determine whether the RL fine-tuning process has:
+ Improved model alignment with your specific use case and human values
+ Maintained or improved model capabilities on key tasks
+ Avoided unintended side effects such as reduced factuality, increased verbosity, or degraded performance on other tasks
+ Met your custom success criteria as defined by your reward function

## When to use RFT evaluation


Use RFT Evaluation in these scenarios:
+ Before RFT Training: Establish baseline metrics on your evaluation dataset
+ During RFT Training: Monitor training progress with intermediate checkpoints
+ After RFT Training: Validate that the final model meets your requirements
+ Comparing Models: Evaluate multiple model versions using consistent reward criteria

**Note**  
Use RFT Evaluation when you need custom, domain-specific metrics. For general-purpose evaluation (accuracy, perplexity, BLEU), use standard evaluation methods.

## Data format requirements


### Input data structure


RFT evaluation input data must follow the OpenAI Reinforcement Fine-Tuning format. Each example is a JSON object containing:
+ `messages` – Array of conversational turns with `system` and `user` roles
+ `reference_answer` – Expected output or ground truth data used by your reward function for scoring

### Data format example


```
{  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "type": "text",  
          "text": "Solve for x. Return only JSON like {\"x\": <number>}. Equation: 2x + 5 = 13"  
        }  
      ]  
    }  
  ],  
  "reference_answer": {  
    "x": 4  
  }  
}
```

### Current limitations

+ Text only: No multimodal inputs (images, audio, video) are supported
+ Single-turn conversations: Only supports single user message (no multi-turn dialogues)
+ JSON format: Input data must be in JSONL format (one JSON object per line)
+ Model outputs: Evaluation is performed on generated completions from the specified model

## Preparing your evaluation recipe


### Sample notebook


For a complete example, see [Evaluation notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook).

### Sample recipe configuration


```
run:  
  name: nova-lite-rft-eval-job    
  model_type: amazon.nova-lite-v1:0:300k    
  model_name_or_path: s3://escrow_bucket/model_location # [MODIFIABLE] S3 path to your model or model identifier  
  replicas: 1 # [MODIFIABLE] For SageMaker Training jobs only; fixed for HyperPod jobs  
  data_s3_path: "" # [REQUIRED FOR HYPERPOD] Leave empty for SageMaker Training jobs and use TrainingInput in sagemaker python SDK  
  output_s3_path: "" # [REQUIRED] Output artifact S3 path for evaluation results  
  
evaluation:  
  task: rft_eval # [FIXED] Do not modify  
  strategy: rft_eval # [FIXED] Do not modify  
  metric: all # [FIXED] Do not modify  
  
# Inference Configuration  
inference:  
  max_new_tokens: 8192 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1 # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0 # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0 # [MODIFIABLE] Sampling temperature (0 = deterministic)  
  top_logprobs: 0 # [MODIFIABLE] Set between 1-20 to enable logprobs output  
  
# =============================================================================  
# Bring Your Own Reinforcement Learning Environment  
# =============================================================================  
rl_env:  
  reward_lambda_arn: arn:aws:lambda:<region>:<account_id>:function:<reward-function-name>
```

## Preset reward functions


Two preset reward functions (`prime_code` and `prime_math`) are available as a Lambda layer for easy integration with your RFT Lambda functions.

### Overview


These preset functions provide out-of-the-box evaluation capabilities for:
+ `prime_code` – Code generation and correctness evaluation
+ `prime_math` – Mathematical reasoning and problem-solving evaluation

### Quick setup


1. Download the Lambda layer from the [nova-custom-eval-sdk releases](https://github.com/aws/nova-custom-eval-sdk/releases).

1. Publish Lambda layer using AWS Command Line Interface (AWS CLI):

   ```
   aws lambda publish-layer-version \
       --layer-name preset-function-layer \
       --description "Preset reward function layer with dependencies" \
       --zip-file fileb://universal_reward_layer.zip \
       --compatible-runtimes python3.9 python3.10 python3.11 python3.12 \
       --compatible-architectures x86_64 arm64
   ```

1. Add the layer to your Lambda function in AWS Management Console (Select the preset-function-layer from custom layer and also add AWSSDKPandas-Python312 for numpy dependencies).

1. Import and use in your Lambda code:

   ```
   from prime_code import compute_score  # For code evaluation
   from prime_math import compute_score  # For math evaluation
   ```

### prime\$1code function


Evaluates Python code generation tasks by executing code against test cases and measuring correctness.

**Example input dataset format**

```
{"messages":[{"role":"user","content":"Write a function that returns the sum of two numbers."}],"reference_answer":{"inputs":["3\n5","10\n-2","0\n0"],"outputs":["8","8","0"]}}
{"messages":[{"role":"user","content":"Write a function to check if a number is even."}],"reference_answer":{"inputs":["4","7","0","-2"],"outputs":["True","False","True","True"]}}
```

**Key features**
+ Automatic code extraction from markdown code blocks
+ Function detection and call-based testing
+ Test case execution with timeout protection
+ Syntax validation and compilation checks
+ Detailed error reporting with tracebacks

### prime\$1math function


Evaluates mathematical reasoning and problem-solving capabilities with symbolic math support.

**Input format**

```
{"messages":[{"role":"user","content":"What is the derivative of x^2 + 3x?."}],"reference_answer":"2*x + 3"}
```

**Key features**
+ Symbolic math evaluation using SymPy
+ Multiple answer formats (LaTeX, plain text, symbolic)
+ Mathematical equivalence checking
+ Expression normalization and simplification

### Data format requirements


**For code evaluation**
+ Inputs: Array of function arguments (proper types: integers, strings, etc.)
+ Outputs: Array of expected return values (proper types: booleans, numbers, etc.)
+ Code: Must be in Python with clear function definitions

**For math evaluation**
+ Reference answer: Mathematical expression or numeric value
+ Response: Can be LaTeX, plain text, or symbolic notation
+ Equivalence: Checked symbolically, not just string matching

### Best practices

+ Use proper data types in test cases (integers vs strings, booleans vs "True")
+ Provide clear function signatures in code problems
+ Include edge cases in test inputs (zero, negative numbers, empty inputs)
+ Format math expressions consistently in reference answers
+ Test your reward function with sample data before deployment

### Error handling


Both functions include robust error handling for:
+ Compilation errors in generated code
+ Runtime exceptions during execution
+ Malformed input data
+ Timeout scenarios for infinite loops
+ Invalid mathematical expressions

## Creating your reward function


### Lambda ARN requirements


Your Lambda ARN must follow this format:

```
"arn:aws:lambda:*:*:function:*SageMaker*"
```

If the Lambda does not have this naming scheme, the job will fail with this error:

```
[ERROR] Unexpected error: lambda_arn must contain one of: ['SageMaker', 'sagemaker', 'Sagemaker'] when running on SMHP platform (Key: lambda_arn)
```

### Lambda request format


Your Lambda function receives data in this format:

```
[  
  {  
    "id": "sample-001",  
    "messages": [  
      {  
        "role": "user",  
        "content": [  
          {  
            "type": "text",  
            "text": "Do you have a dedicated security team?"  
          }  
        ]  
      },  
      {  
        "role": "nova_assistant",  
        "content": [  
          {  
            "type": "text",  
            "text": "As an AI developed by Company, I don't have a dedicated security team..."  
          }  
        ]  
      }  
    ],  
    "reference_answer": {  
      "compliant": "No",  
      "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
  }  
]
```

**Note**  
The message structure includes the nested `content` array, matching the input data format. The last message with role `nova_assistant` contains the model's generated response.

### Lambda response format


Your Lambda function must return data in this format:

```
[  
  {  
    "id": "sample-001",  
    "aggregate_reward_score": 0.75,  
    "metrics_list": [  
      {  
        "name": "accuracy",  
        "value": 0.85,  
        "type": "Metric"  
      },  
      {  
        "name": "fluency",  
        "value": 0.90,  
        "type": "Reward"  
      }  
    ]  
  }  
]
```

**Response fields**
+ `id` – Must match the input sample ID
+ `aggregate_reward_score` – Overall score (typically 0.0 to 1.0)
+ `metrics_list` – Array of individual metrics with:
  + `name` – Metric identifier (e.g., "accuracy", "fluency")
  + `value` – Metric score (typically 0.0 to 1.0)
  + `type` – Either "Metric" (for reporting) or "Reward" (used in training)

## IAM permissions


### Required permissions


Your SageMaker execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker execution role:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "lambda:InvokeFunction"  
      ],  
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"  
    }  
  ]  
}
```

### Lambda execution role


Your Lambda function's execution role needs basic Lambda execution permissions:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "logs:CreateLogGroup",  
        "logs:CreateLogStream",  
        "logs:PutLogEvents"  
      ],  
      "Resource": "arn:aws:logs:*:*:*"  
    }  
  ]  
}
```

If your Lambda function accesses other AWS services (e.g., S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Executing the evaluation job


1. **Prepare your data** – Format your evaluation data according to the data format requirements and upload your JSONL file to S3: `s3://your-bucket/eval-data/eval_data.jsonl`

1. **Configure your recipe** – Update the sample recipe with your configuration:
   + Set `model_name_or_path` to your model location
   + Set `lambda_arn` to your reward function ARN
   + Set `output_s3_path` to your desired output location
   + Adjust `inference` parameters as needed

   Save the recipe as `rft_eval_recipe.yaml`

1. **Run the evaluation** – Execute the evaluation job using the provided notebook: [Evaluation notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook)

1. **Monitor progress** – Monitor your evaluation job through:
   + SageMaker Console: Check job status and logs
   + CloudWatch Logs: View detailed execution logs
   + Lambda Logs: Debug reward function issues

## Understanding evaluation results


### Output format


The evaluation job outputs results to your specified S3 location in JSONL format. Each line contains the evaluation results for one sample:

```
{  
  "id": "sample-001",  
  "aggregate_reward_score": 0.75,  
  "metrics_list": [  
    {  
      "name": "accuracy",  
      "value": 0.85,  
      "type": "Metric"  
    },  
    {  
      "name": "fluency",  
      "value": 0.90,  
      "type": "Reward"  
    }  
  ]  
}
```

**Note**  
The RFT Evaluation Job Output is identical to the Lambda Response format. The evaluation service passes through your Lambda function's response without modification, ensuring consistency between your reward calculations and the final results.

### Interpreting results


**Aggregate reward score**
+ Range: Typically 0.0 (worst) to 1.0 (best), but depends on your implementation
+ Purpose: Single number summarizing overall performance
+ Usage: Compare models, track improvement over training

**Individual metrics**
+ Metric Type: Informational metrics for analysis
+ Reward Type: Metrics used during RFT training
+ Interpretation: Higher values generally indicate better performance (unless you design inverse metrics)

### Performance benchmarks


What constitutes "good" performance depends on your use case:


| Score range | Interpretation | Action | 
| --- | --- | --- | 
| 0.8 - 1.0 | Excellent | Model ready for deployment | 
| 0.6 - 0.8 | Good | Minor improvements may be beneficial | 
| 0.4 - 0.6 | Fair | Significant improvement needed | 
| 0.0 - 0.4 | Poor | Review training data and reward function | 

**Important**  
These are general guidelines. Define your own thresholds based on business requirements, baseline model performance, domain-specific constraints, and cost-benefit analysis of further training.

## Troubleshooting


### Common issues



| Issue | Cause | Solution | 
| --- | --- | --- | 
| Lambda timeout | Complex reward calculation | Increase Lambda timeout or optimize function | 
| Permission denied | Missing IAM permissions | Verify SageMaker role can invoke Lambda | 
| Inconsistent scores | Non-deterministic reward function | Use fixed seeds or deterministic logic | 
| Missing results | Lambda errors not caught | Add comprehensive error handling in Lambda | 

### Debug checklist

+ Verify input data follows the correct format with nested content arrays
+ Confirm Lambda ARN is correct and function is deployed
+ Check IAM permissions for SageMaker → Lambda invocation
+ Review CloudWatch logs for Lambda errors
+ Validate Lambda response matches expected format

## Best practices

+ Start Simple: Begin with basic reward functions and iterate
+ Test Lambda Separately: Use Lambda test events before full evaluation
+ Validate on Small Dataset: Run evaluation on subset before full dataset
+ Version Control: Track reward function versions alongside model versions
+ Monitor Costs: Lambda invocations and compute time affect costs
+ Log Extensively: Use print statements in Lambda for debugging
+ Set Timeouts Appropriately: Balance between patience and cost
+ Document Metrics: Clearly define what each metric measures

## Next steps


After completing RFT evaluation:
+ If results are satisfactory: Deploy model to production
+ If improvement needed:
  + Adjust reward function
  + Collect more training data
  + Modify training hyperparameters
  + Run additional RFT training iterations
+ Continuous monitoring: Re-evaluate periodically with new data

# Implementing reward functions
Implementing reward functions

## Overview


The reward function (also called scorer or grader) is the core component that evaluates model responses and provides feedback signals for training. It must be implemented as an Lambda function that accepts model responses and returns reward scores.

## Interface format


Your reward function must accept and return data in the following format:

**Sample input sample to training**

```
{  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        }  
    ],              
   "reference_answer": {  
       "compliant": "No",  
       "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
}
```

**Sample payload for the reward lambda**

The container automatically transforms your data before sending it to your Lambda function by:

1. Generating a model response for each prompt

1. Appending the assistant turn (generated response) to the messages array

1. Adding a unique `id` field for tracking

Your Lambda function will receive data in this transformed format:

```
{    
   "id": "123",  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        },  
        {  
            "role": "assistant",  
            "content": "As an AI developed by Amazon, I don not have a dedicated security team..."  
        }  
    ],              
    # Following section will be same as your training dataset sample  
    "reference_answer": {  
        "compliant": "No",  
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
}
```

**Reward Lambda contract**

```
def lambda_handler(event, context):  
   return lambda_grader(event)  
  
def lambda_grader(samples: list[dict]) -> list[dict]:  
    """  
    Args:  
        samples: List of dictionaries in OpenAI format  
          
        Example input:  
        {     
            "id": "123",  
            "messages": [  
                {  
                    "role": "user",  
                    "content": "Do you have a dedicated security team?"  
                },  
                {  
                    "role": "assistant",  
                    "content": "As an AI developed by Company, I don nott have a dedicated security team..."  
                }  
            ],              
            # This section will be same as your training dataset  
            "reference_answer": {  
                "compliant": "No",  
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
            }  
        }  
      
    Returns:  
        List of dictionaries with reward scores:  
        {  
            "id": str,                              # Same id as input sample  
            "aggregate_reward_score": float,        # Overall score for the sample  
            "metrics_list": [                       # OPTIONAL: Component scores  
                {  
                    "name": str,                    # Name of the component score  
                    "value": float,                 # Value of the component score  
                    "type": str                     # "Reward" or "Metric"  
                }  
            ]  
        }  
    """
```

## Input and output fields


### Input fields



| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Unique identifier for the sample | Echoed back in output. String format | 
| messages | Ordered chat history in OpenAI format | Array of message objects | 
| messages[].role | Speaker of the message | Common values: "user", "assistant", "system" | 
| messages[].content | Text content of the message | Plain string | 
| \$1\$1metadata | Free-form information to aid grading | Object; optional fields passed from training data | 

### Output fields



| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Same identifier as input sample | Must match input | 
| aggregate\$1reward\$1score | Overall score for the sample | Float (e.g., 0.0–1.0 or task-defined range) | 
| metrics\$1list | Component scores that make up the aggregate | Array of metric objects | 

## Technical constraints

+ **Timeout limit** – 15 minutes maximum execution time per Lambda invocation
+ **Concurrency** – Must handle `rollout_worker_replicas * 64` concurrent requests
+ **Reliability** – Must implement proper error handling and return valid scores consistently
+ **Performance** – Optimize for fast execution (seconds, not minutes) to enable efficient training

**Best practices**
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement retry logic for transient failures
+ Cache reusable computations
+ Test thoroughly before training to ensure bug-free execution

## Using custom reward functions


Implement custom reward functions when you have task-specific evaluation criteria:
+ **Define evaluation criteria** – Determine what makes a good response for your task
+ **Implement Lambda function** – Create an Lambda function following the interface format
+ **Test locally** – Validate your function returns correct scores for sample inputs
+ **Deploy to AWS** – Deploy your Lambda and note the ARN
+ **Configure recipe** – Add the Lambda ARN to your recipe's `reward_lambda_arn` field
+ **Test with small dataset** – Run RFT with minimal data to verify integration

## IAM permissions


### Required permissions


Your SageMaker execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker execution role:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "lambda:InvokeFunction"  
      ],  
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"  
    }  
  ]  
}
```

### Lambda execution role


Your Lambda function's execution role needs basic Lambda execution permissions:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "logs:CreateLogGroup",  
        "logs:CreateLogStream",  
        "logs:PutLogEvents"  
      ],  
      "Resource": "arn:aws:logs:*:*:*"  
    }  
  ]  
}
```

Additional permissions: If your Lambda function accesses other AWS services (for example, S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Example: LLM As a Judge reward function


This example demonstrates using Amazon Bedrock models as judges to evaluate model responses by comparing them against reference answers. This Lambda template provides a framework for customers to implement calls to Amazon Bedrock for inference requests to process judge evaluations. The Lambda function maintains the same input/output contract as other reward functions.

### Implementation


This Lambda function implements a two-stage evaluation process: the `lambda_handler` extracts model responses and reference answers from incoming samples, then the `lambda_graded` function calls Amazon Bedrock to score the semantic similarity between them. The implementation includes robust error handling with automatic retries for transient failures and supports flexible reference answer formats (both string and structured dictionary formats).

**Implementation details:**
+ **Retry Logic**: Implements exponential backoff (1s, 2s, 4s) for throttling exceptions to handle Bedrock API rate limits
+ **Error Handling**: Returns score of 0.0 for failed evaluations rather than raising exceptions
+ **Deterministic Scoring**: Uses temperature=0.0 to ensure consistent scores across evaluations
+ **Flexible Reference Format**: Automatically handles both string and dictionary reference answers
+ **Score Clamping**: Ensures all scores fall within valid [0.0, 1.0] range
+ **Model Agnostic**: Change JUDGE\$1MODEL\$1ID to use any Amazon Bedrock model (Nova, Llama, Mistral, etc.)

```
"""  
LLM Judge Lambda POC - Working implementation using Amazon Bedrock  
"""  
  
import json  
import time  
import boto3  
  
bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')  
JUDGE_MODEL_ID = "anthropic.claude-3-5-sonnet-20240620-v1:0"  
SYSTEM_PROMPT = "You must output ONLY a number between 0.0 and 1.0. No explanations, no text, just the number."  
  
JUDGE_PROMPT_TEMPLATE = """Compare the following two responses and rate how similar they are on a scale of 0.0 to 1.0, where:  
- 1.0 means the responses are semantically equivalent (same meaning, even if worded differently)  
- 0.5 means the responses are partially similar  
- 0.0 means the responses are completely different or contradictory  
  
Response A: {response_a}  
  
Response B: {response_b}  
  
Output ONLY a number between 0.0 and 1.0. No explanations."""  
  
  
def lambda_graded(response_a: str, response_b: str, max_retries: int = 3) -> float:  
    """Call Bedrock to compare responses and return similarity score."""  
    prompt = JUDGE_PROMPT_TEMPLATE.format(response_a=response_a, response_b=response_b)  
      
    for attempt in range(max_retries):  
        try:  
            response = bedrock_runtime.converse(  
                modelId=JUDGE_MODEL_ID,  
                messages=[{"role": "user", "content": [{"text": prompt}]}],  
                system=[{"text": SYSTEM_PROMPT}],  
                inferenceConfig={"temperature": 0.0, "maxTokens": 10}  
            )  
            print(f"Bedrock call successful: {response}")  
            output = response['output']['message']['content'][0]['text'].strip()  
            score = float(output)  
            print(f"Score parsed: {score}")  
            return max(0.0, min(1.0, score))  
                  
        except Exception as e:  
            if "ThrottlingException" in str(e) and attempt < max_retries - 1:  
                time.sleep(2 ** attempt)  
            else:  
                print(f"Bedrock call failed: {e}")  
                return None  
    return None  
  
  
def lambda_handler(event, context):  
    """AWS Lambda handler - processes samples from RFTEvalInvoker."""  
    try:  
        samples = event if isinstance(event, list) else [event]  
        results = []  
          
        for sample in samples:  
            sample_id = sample.get("id", "unknown")  
            messages = sample.get("messages", [])  
              
            # Extract assistant response (response A)  
            response_a = ""  
            for msg in messages:  
                if msg.get("role") in ["assistant", "nova_assistant"]:  
                    response_a = msg.get("content", "")  
                    break  
              
            # Extract reference answer from root level (no longer in metadata)  
            reference_answer = sample.get("reference_answer", "")  
              
            # Handle both string and dict reference_answer formats  
            if isinstance(reference_answer, dict):  
                # If reference_answer is a dict, extract the explanation or compliant field  
                response_b = reference_answer.get("explanation", reference_answer.get("compliant", ""))  
            else:  
                response_b = reference_answer  
              
            if not response_a or not response_b:  
                results.append({  
                    "id": sample_id,  
                    "aggregate_reward_score": 0.0,  
                    "metrics_list": [{"name": "similarity_score", "value": 0.0, "type": "Metric"}]  
                })  
                continue  
              
            # Get similarity score  
            score = lambda_graded(response_a, response_b)  
              
            results.append({  
                "id": sample_id,  
                "aggregate_reward_score": score,  
                "metrics_list": [  
                    {  
                        "name": "similarity_score",  
                        "value": score,  
                        "type": "Metric"  
                    }  
                ]  
            })  
          
        return {"statusCode": 200, "body": json.dumps(results)}  
          
    except Exception as e:  
        print(f"Error: {e}")  
        return {"statusCode": 500, "body": json.dumps({"error": str(e)})}
```

### Input format


The Lambda receives the same input format as other reward functions:

```
{  
    "id": "sample-001",  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        },  
        {  
            "role": "assistant",  
            "content": "As an AI developed by Amazon, I don't have a dedicated security team..."  
        }  
    ],  
    "reference_answer": {  
        "compliant": "No",  
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    },  
    "my_custom_field": "custom_value"  
}
```

### Output format


```
{  
    "id": "sample-001",  
    "aggregate_reward_score": 0.85,  
    "metrics_list": [  
        {  
            "name": "similarity_score",  
            "value": 0.85,  
            "type": "Metric"  
        }  
    ]  
}
```

### Deployment considerations


You may also need to adjust the prompt template and inference parameters based on your chosen model's capabilities and API format.
+ **IAM Permissions**: Lambda execution role must have `bedrock:InvokeModel` permission for your chosen model
+ **Timeout**: Set Lambda timeout to at least 60 seconds to accommodate Bedrock API latency and retries
+ **Region**: Deploy in a region where your chosen Bedrock model is available
+ **Cost**: Monitor Bedrock API usage as each evaluation makes one API call per sample
+ **Throughput**: For large-scale evaluations, request increased Bedrock quotas to avoid throttling

**Increasing Bedrock Throughput**

If you experience throttling during evaluation, increase your Bedrock model quotas:
+ Navigate to the AWS Service Quotas console
+ Search for "Bedrock" and select your region
+ Find the quota for your chosen model (for example, "Invocations per minute for Claude 3.5 Sonnet")
+ Click "Request quota increase" and specify your desired throughput
+ Provide justification for the increase (for example, "RFT evaluation workload")

The Lambda's built-in retry logic handles occasional throttling, but sustained high-volume evaluations require appropriate quota increases.

**Required IAM Policy:**

```
{  
    "Version": "2012-10-17",		 	 	   
    "Statement": [  
        {  
            "Effect": "Allow",  
            "Action": [  
                "bedrock:InvokeModel"  
            ],  
            "Resource": "arn:aws:bedrock:*::foundation-model/*"  
        }  
    ]  
}
```