# Amazon Nova customization on SageMaker HyperPod
On SageMaker HyperPod

You can customize Amazon Nova models, including the enhanced Amazon Nova 2.0 models, using [ Amazon Nova recipes](nova-model-recipes.md) and train them on Hyperpod. A recipe is a YAML configuration file that provides details to SageMaker AI on how to run your model customization job. SageMaker HyperPod supports two types of services: Forge and Non-forge.

Hyperpod offers high-performance computing with optimized GPU instances and Amazon FSx for Lustre storage, robust monitoring through integration with tools like TensorBoard, flexible checkpoint management for iterative improvement, seamless deployment to Amazon Bedrock for inference, and efficient scalable multi-node distributed training-all working together to provide organizations with a secure, performant, and flexible environment to tailor Amazon Nova models to their specific business requirements.

Amazon Nova customization on SageMaker HyperPod stores model artifacts including model checkpoints in a service-managed Amazon S3 bucket. Artifacts in the service-managed bucket are encrypted with SageMaker AI-managed AWS KMS keys. Service-managed Amazon S3 buckets don't currently support data encryption using customer-managed KMS keys. You can use this checkpoint location for evaluation jobs or Amazon Bedrock inference.

Standard pricing can apply for compute instances, Amazon S3 storage, and FSx for Lustre. For pricing details, see [Hyperpod pricing](https://aws.amazon.com/sagemaker-ai/pricing/), [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/), and [FSx for Lustre pricing](https://aws.amazon.com/fsx/lustre/pricing/).

## Compute requirements for Amazon Nova 2 models


The following tables summarize the computational requirements for SageMaker HyperPod and SageMaker AI training jobs training for Amazon Nova 2 models.


**Nova 2 Training Requirements**  

| Training Technique | Minimum Instances | Instance Type | GPU Count | Notes | Supported Models | 
| --- |--- |--- |--- |--- |--- |
| SFT (LoRA) | 4 | P5.48xlarge | 16 | Parameter-efficient fine-tuning | Nova 2 Lite | 
| SFT (Full Rank) | 4 | P5.48xlarge | 32 | Full model fine-tuning | Nova 2 Lite | 
| RFT on SageMaker Training Jobs (LoRA) | 2 | P5.48xlarge | 16 | Custom Reward Functions in your AWS Environment | Nova 2 Lite | 
| RFT on SageMaker Training Jobs (Full Rank) | 4 | P5.48xlarge | 32 | 32K context length | Nova 2 Lite | 
| RFT on SageMaker HyperPod | 8 | P5.48xlarge | 64 | Default 8192 context length | Nova 2 Lite | 
| CPT | 4 | P5.48xlarge | 16 | Processes approximately 400M tokens per instance per day | Nova 2 Lite | 

To optimize your Amazon Nova model customization workflows on Hyperpod, follow these recommended best practices for efficient training, resource management, and successful model deployment.

## Best Practices for Amazon Nova customization


### Overview


This section provides an overview of customization techniques and helps you choose the best approach for your needs and available data.

#### Two stages of LLM training


Large language model training consists of two major stages: pre-training and post-training. During pre-training, the model processes tokens of raw text and optimizes for next-token prediction. This process creates a pattern completer that absorbs syntax, semantics, facts, and reasoning patterns from web and curated text. However, the pre-trained model doesn't understand instructions, user goals, or context-appropriate behavior. It continues text in whatever style fits its training distribution. A pre-trained model autocompletes rather than follows directions, produces inconsistent formatting, and can mirror undesirable biases or unsafe content from the training data. Pre-training builds general competence, not task usefulness.

Post-training transforms the pattern completer into a useful assistant. You run multiple rounds of Supervised Fine-Tuning (SFT) to teach the model to follow instructions, adhere to schemas and policies, call tools, and produce reliable outputs by imitating high-quality demonstrations. This alignment teaches the model to respond to prompts as tasks rather than text to continue. You then apply Reinforcement Fine-Tuning (RFT) to optimize behavior using measurable feedback (such as verifiers or an LLM-as-a-judge), balancing trade-offs like accuracy versus brevity, safety versus coverage, or multi-step reasoning under constraints. In practice, you alternate SFT and RFT in cycles to shape the pre-trained model into a reliable, policy-aligned system that performs complex tasks consistently.

### Choose the right customization approach


In this section we will cover post training customization strategies: RFT and SFT.

#### Reinforcement fine-tuning (RFT)


Reinforcement fine-tuning improves model performance through feedback signals—measurable scores or rewards that indicate response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards. This approach works well for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications that require nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that you can programmatically evaluate. For example, answering complex legal questions is an ideal use case for RFT because you want to teach the model how to reason better to answer questions more accurately.

##### How it works


In reinforcement fine-tuning, you start from an instruction-tuned baseline and treat each prompt like a small tournament. For a given input, you sample a handful of candidate answers from the model, score each one with the reward function, then rank them within that group. The update step nudges the model to make higher-scoring candidates more likely next time and lower-scoring ones less likely, while a stay-close-to-baseline constraint keeps behavior from drifting or becoming verbose or exploitative. You repeat this loop over many prompts, refreshing hard cases, tightening verifiers or judge rubrics when you see exploits, and continuously tracking task metrics.

##### When to use RFT


Tasks that benefit most from RFT share several traits. They have measurable success signals even when a single correct output is hard to specify. They admit partial credit or graded quality so you can rank better versus worse answers within a prompt or using a reward function. They involve multiple objectives that must be balanced (such as accuracy with brevity, clarity, safety, or cost). They require adherence to explicit constraints that you can programmatically check. They operate in tool-mediated or environment-based settings where outcomes are observable (success or failure, latency, resource use). They occur in low-label regimes where collecting gold targets is expensive but automated or rubric-based feedback is plentiful. RFT works best when you can turn quality into a reliable scalar or ranking and want the model to preferentially amplify higher-scoring behaviors without needing exhaustive labeled targets.

**Consider other methods when:**
+ You have plentiful, reliable labeled input-output pairs – Use SFT
+ The main gap is knowledge or jargon – Use retrieval-augmented generation (RAG)
+ Your reward signal is noisy or unreliable and you can't fix it with better rubrics or checkers – Stabilize that first before RFT

##### When not to use RFT


Avoid RFT in these situations:
+ You can cheaply produce reliable labeled input-output pairs (SFT is simpler, cheaper, and more stable)
+ The gap is knowledge or jargon rather than behavior (use RAG)
+ Your reward signal is noisy, sparse, easy to game, or expensive or slow to compute (fix the evaluator first)
+ Baseline performance is near-zero (bootstrap with SFT before optimizing preferences)
+ The task has deterministic schemas, strict formatting, or a single correct answer (SFT or rule-based validation works better)
+ Tight latency or cost budgets can't absorb the extra sampling or exploration RFT requires
+ Safety or policy constraints aren't crisply specified and enforceable in the reward

If you can point to "the right answer," use SFT. If you need new knowledge, use RAG. Use RFT only after you have a solid baseline and a robust, fast, hard-to-exploit reward function.

#### Supervised fine-tuning (SFT)


Supervised fine-tuning trains the LLM on a dataset of human-labeled input-output pairs for your task. You provide examples of prompts (questions, instructions, and so on) with the correct or desired responses, and continue training the model on these examples. The model adjusts its weights to minimize a supervised loss (typically cross-entropy between its predictions and the target output tokens). This is the same training used in most supervised machine learning tasks, applied to specialize an LLM.

SFT changes behavior, not knowledge. It doesn't teach the model new facts or jargon it didn't see in pre-training. It teaches the model how to answer, not what to know. If you need new domain knowledge (such as internal terminology), use retrieval-augmented generation (RAG) to provide that context at inference time. SFT then adds the desired instruction-following behavior on top.

##### How it works


SFT optimizes LLM by minimizing the average cross-entropy loss on response tokens, treating prompt tokens as context and masking them from the loss. The model internalizes your target style, structure, and decision rules, learning to generate the correct completion for each prompt. For example, to classify documents into custom categories, you fine-tune the model with prompts (the document text) and labeled completions (the category labels). You train on those pairs until the model outputs the right label for each prompt with high probability.

You can perform SFT with as few as a few hundred examples and scale up to a few hundred thousand. SFT samples must be high quality and directly aligned with the desired model behavior.

##### When to use SFT


Use SFT when you have a well-defined task with clear desired outputs. If you can explicitly state "Given X input, the correct output is Y" and gather examples of such mappings, supervised fine-tuning is a good choice. SFT excels in these scenarios:
+ **Structured or complex classification tasks** – Classify internal documents or contracts into many custom categories. With SFT, the model learns these specific categories better than prompting alone.
+ **Question-answering or transformation tasks with known answers** – Fine-tune a model to answer questions from a company's knowledge base, or convert data between formats where each input has a correct response.
+ **Formatting and style consistency** – Train the model to always respond in a certain format or tone by fine-tuning on examples of the correct format or tone. For instance, training on prompt-response pairs that demonstrate a particular brand voice teaches the model to generate outputs with that style. Instruction-following behavior is often initially taught through SFT on curated examples of good assistant behavior.

SFT is the most direct way to teach an LLM a new skill or behavior when you can specify what the right behavior looks like. It uses the model's existing language understanding and focuses it on your task. Use SFT when you want the model to do a specific thing and you have or can create a dataset of examples.

Use SFT when you can assemble high-quality prompt and response pairs that closely mirror the behavior you want. It fits tasks with clear targets or deterministic formats such as schemas, function or tool calls, and structured answers where imitation is an appropriate training signal. The goal is behavior shaping: teaching the model to treat prompts as tasks, follow instructions, adopt tone and refusal policies, and produce consistent formatting. Plan for at least hundreds of demonstrations, with data quality, consistency, and deduplication mattering more than raw volume. For a straightforward, cost-efficient update, use parameter-efficient methods like Low-Rank Adaptation to train small adapters while leaving most of the backbone untouched.

##### When not to use SFT


Don't use SFT when the gap is knowledge rather than behavior. It doesn't teach the model new facts, jargon, or recent events. In those cases, use retrieval-augmented generation to bring external knowledge at inference. Avoid SFT when you can measure quality but can't label a single right answer. Use reinforcement fine-tuning with verifiable rewards or an LLM-as-a-judge to optimize those rewards directly. If your needs or content change frequently, rely on retrieval and tool use rather than retraining the model.

**Topics**
+ [

## Compute requirements for Amazon Nova 2 models
](#nova-hp-compute-2)
+ [

## Best Practices for Amazon Nova customization
](#best-practices)
+ [

# Nova Forge SDK
](nova-hp-forge-sdk.md)
+ [

# Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)
](nova-hp-cluster.md)
+ [

# Amazon SageMaker HyperPod Essential Commands Guide
](nova-hp-essential-commands-guide.md)
+ [

# Nova Forge access and setup for
](nova-forge-hp-access.md)
+ [

# Training for Amazon Nova models
](nova-hp-training.md)
+ [

# Evaluating your trained model
](nova-hp-evaluate.md)
+ [

# Monitoring HyperPod jobs with MLflow
](nova-hp-mlflow.md)

# Nova Forge SDK


The Amazon Nova Forge SDK is a comprehensive Python SDK that provides a unified, programmatic interface for the complete Amazon Amazon Nova model customization lifecycle. The SDK simplifies model customization by offering a single, consistent API for training, evaluation, monitoring, deployment, and inference across Amazon SageMaker and Amazon Bedrock platforms.

For more information, see [Nova Forge SDK](nova-forge-sdk.md).

# Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)
HP cluster setup

To customize a model on Hyperpod, the necessary infrastructure must be set up. For details on setting up a SageMaker HyperPod EKS cluster with a restricted instance group (RIG), visit the [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US), which provides a detailed walkthrough of the setup process.

# Amazon SageMaker HyperPod Essential Commands Guide
Essential Commands Guide

Amazon SageMaker HyperPod provides extensive command-line functionality for managing training workflows. This guide covers essential commands for common operations, from connecting to your cluster to monitoring job progress.

**Prerequisites**  
Before using these commands, ensure you have completed the following setup:
+ SageMaker HyperPod cluster with RIG created (typically in us-east-1)
+ Output Amazon S3 bucket created for training artifacts
+ IAM roles configured with appropriate permissions
+ Training data uploaded in correct JSONL format
+ FSx for Lustre sync completed (verify in cluster logs on first job)

**Topics**
+ [

## Installing Recipe CLI
](#nova-hp-essential-commands-guide-install)
+ [

## Connecting to your cluster
](#nova-hp-essential-commands-guide-connect)
+ [

## Starting a training job
](#nova-hp-essential-commands-guide-start-job)
+ [

## Checking job status
](#nova-hp-essential-commands-guide-status)
+ [

## Monitoring job logs
](#nova-hp-essential-commands-guide-logs)
+ [

## Listing active jobs
](#nova-hp-essential-commands-guide-list-jobs)
+ [

## Canceling a job
](#nova-hp-essential-commands-guide-cancel-job)
+ [

## Running an evaluation job
](#nova-hp-essential-commands-guide-evaluation)
+ [

## Common issues
](#nova-hp-essential-commands-guide-troubleshooting)

## Installing Recipe CLI


Navigate to the root of your recipe repository before running the installation command.

**Use the Hyperpodrecipes repository if using Non Forge customization techniques, for Forge based customization refer to the forge specific recipe repository.**  
Run the following commands to install the SageMaker HyperPod CLI:

**Note**  
Make sure you aren’t in an active conda / anaconda / miniconda environment or another virtual environment  
If you are, please exit the environment using:  
`conda deactivate` for conda / anaconda / miniconda environments
`deactivate` for python virtual environments

 If you are using a Non Forge customization technique, download the sagemaker-hyperpod-recipes as shown below:

```
git clone -b release_v2 https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli
pip install -e .
cd ..
root_dir=$(pwd)
export PYTHONPATH=${root_dir}/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

If you are a** Forge Subscriber,** you should be downloading the recipes using below mentioned process.

```
mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

**Tip**  
To use a [new virtual environment](https://docs.python.org/3/library/venv.html) before running `pip install -e .`, run:  
`python -m venv nova_forge`
`source nova_forge/bin/activate`
Your command line will now display (nova\$1forge) at the beginning of your prompt
This ensures there are no competing dependencies when using the CLI

**Purpose**: Why do we do `pip install -e .` ?

This command installs the SageMaker HyperPod CLI in editable mode, allowing you to use updated recipes without reinstalling each time. It also enables you to add new recipes that the CLI can automatically pick up.

## Connecting to your cluster


Connect the SageMaker HyperPod CLI to your cluster before running any jobs:

```
export AWS_REGION=us-east-1 &&  SageMaker HyperPod  connect-cluster --cluster-name <your-cluster-name> --region us-east-1
```

**Important**  
This command creates a context file (`/tmp/hyperpod_context.json`) that subsequent commands require. If you see an error about this file not found, re-run the connect command.

**Pro tip**: You can further configure your cluster to always use the `kubeflow` namespace by adding the `--namespace kubeflow` argument to your command as follows:

```
export AWS_REGION=us-east-1 && \
hyperpod connect-cluster \
--cluster-name <your-cluster-name> \
--region us-east-1 \
--namespace kubeflow
```

This saves you the effort of adding the `-n kubeflow` in every command when interacting with your jobs.

## Starting a training job


**Note**  
If running PPO/RFT jobs, ensure you add label selector settings to `src/hyperpod_cli/sagemaker_hyperpod_recipes/recipes_collection/cluster/k8s.yaml` so that all pods are schedule on the same node.  

```
label_selector:
  required:
    sagemaker.amazonaws.com/instance-group-name:
      - <rig_group>
```

Launch a training job using a recipe with optional parameter overrides:

```
hyperpod start-job -n kubeflow \
--recipe fine-tuning/nova/nova_1_0/nova_micro/SFT/nova_micro_1_0_p5_p4d_gpu_lora_sft \
--override-parameters '{
"instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest"
  }'
```

**Expected output**:

```
Final command: python3 <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=fine-tuning/nova/nova_micro_p5_gpu_sft cluster_type=k8s cluster=k8s base_results_dir=/local/home/<username>/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" container="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:HP-SFT-DATAMIX-latest"

Prepared output directory at /local/home/<username>/results/<job-name>/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/<username>/results/<job-name>/<job-name>_launch.sh
Running Helm script: /local/home/<username>/results/<job-name>/<job-name>_launch.sh

NAME: <job-name>
LAST DEPLOYED: Mon Sep 15 20:56:50 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/<your-cluster-name>"
}
```

## Checking job status


Monitor your running jobs using kubectl:

```
kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep <your-job-name>)
```

**Understanding pod statuses**  
The following table explains common pod statuses:


| Status | Description | 
| --- |--- |
| `Pending` | Pod accepted but not yet scheduled onto a node, or waiting for container images to be pulled | 
| `Running` | Pod bound to a node with at least one container running or starting | 
| `Succeeded` | All containers completed successfully and won't restart | 
| `Failed` | All containers terminated with at least one ending in failure | 
| `Unknown` | Pod state cannot be determined (usually due to node communication issues) | 
| `CrashLoopBackOff` | Container repeatedly failing; Kubernetes backing off from restart attempts | 
| `ImagePullBackOff` / `ErrImagePull` | Unable to pull container image from registry | 
| `OOMKilled` | Container terminated for exceeding memory limits | 
| `Completed` | Job or Pod finished successfully (batch job completion) | 

**Tip**  
Use the `-w` flag to watch pod status updates in real-time. Press `Ctrl+C` to stop watching.

## Monitoring job logs


You can view your logs one of three ways:

**Using CloudWatch**  
Your logs are available in your AWS account that contains the Hyperpodcluster under CloudWatch. To view them in your browser, navigate to the CloudWatch homepage in your account and search for your cluster name. For example, if your cluster were called `my-hyperpod-rig` the log group would have the prefix:
+ **Log group**: `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}`
+ Once you're in the log group, you can find your specific log using the node instance ID such as - `hyperpod-i-00b3d8a1bf25714e4`.
  + `i-00b3d8a1bf25714e4` here represents the Hyperpodfriendly machine name where your training job is running. Recall how in the previous command `kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-cpt-run)` output we captured a column called **NODE**.
  + The "master" node run was in this case running on hyperpod-`i-00b3d8a1bf25714e4` and thus we'll use that string to select the log group to view. Select the one that says `SagemakerHyperPodTrainingJob/rig-group/[NODE]`

**Using CloudWatch Insights**  
If you have your job name handy and don't wish to go through all the steps above, you can simply query all logs under `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}` to find the individual log.

CPT:

```
fields @timestamp, @message, @logStream, @log
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc
| limit 100
```

For job completion replace `Starting CPT Job` with `CPT Job completed`

Then you can click through the results and pick the one that says "Epoch 0" since that will be your master node.

**Using the AWS CLI**  
You may choose to tail your logs using the AWS CLI. Before doing so, please check your AWS CLI version using `aws --version`. It is also recommended to use this utility script that helps in live log tracking in your terminal

**for V1**:

```
aws logs get-log-events \
--log-group-name /aws/sagemaker/YourLogGroupName \
--log-stream-name YourLogStream \
--start-from-head | jq -r '.events[].message'
```

**for V2**:

```
aws logs tail /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
--since 10m \
--follow
```

## Listing active jobs


View all jobs running in your cluster:

```
hyperpod list-jobs -n kubeflow
```

**Example output**:

```
{
  "jobs": [
    {
      "Name": "test-run-nhgza",
      "Namespace": "kubeflow",
      "CreationTime": "2025-10-29T16:50:57Z",
      "State": "Running"
    }
  ]
}
```

## Canceling a job


Stop a running job at any time:

```
hyperpod cancel-job --job-name <job-name> -n kubeflow
```

**Finding your job name**  
**Option 1: From your recipe**

The job name is specified in your recipe's `run` block:

```
run:
  name: "my-test-run"                        # This is your job name
  model_type: "amazon.nova-micro-v1:0:128k"
  ...
```

**Option 2: From list-jobs command**

Use `hyperpod list-jobs -n kubeflow` and copy the `Name` field from the output.

## Running an evaluation job


Evaluate a trained model or base model using an evaluation recipe.

**Prerequisites**  
Before running evaluation jobs, ensure you have:
+ Checkpoint Amazon S3 URI from your training job's `manifest.json` file (for trained models)
+ Evaluation dataset uploaded to Amazon S3 in the correct format
+ Output Amazon S3 path for evaluation results

**Command**  
Run the following command to start an evaluation job:

```
hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'
```

**Parameter descriptions**:
+ `recipes.run.name`: Unique name for your evaluation job
+ `recipes.run.model_name_or_path`: Amazon S3 URI from `manifest.json` or base model path (e.g., `nova-micro/prod`)
+ `recipes.run.output_s3_path`: Amazon S3 location for evaluation results
+ `recipes.run.data_s3_path`: Amazon S3 location of your evaluation dataset

**Tips**:
+ **Model-specific recipes**: Each model size (micro, lite, pro) has its own evaluation recipe
+ **Base model evaluation**: Use base model paths (e.g., `nova-micro/prod`) instead of checkpoint URIs to evaluate base models

**Evaluation data format**  
**Input format (JSONL)**:

```
{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}
```

**Output format**:

```
{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}
```

**Field descriptions**:
+ `prompt`: Formatted input sent to the model
+ `inference`: Model's generated response
+ `gold`: Expected correct answer from input dataset
+ `metadata`: Optional metadata passed through from input

## Common issues

+ `ModuleNotFoundError: No module named 'nemo_launcher'`, you might've to add `nemo_launcher` to your python path based on where `hyperpod_cli` is installed. Sample command:

  ```
  export PYTHONPATH=<path_to_hyperpod_cli>/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
  ```
+ `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/hyperpod_current_context.json'` indicates you missed running the hyperpod connect cluster command.
+ If you don't see your job scheduled, double check if the output of your SageMaker HyperPod CLI has this section with job names and other metadata. If not, re-install helm chart by running:

  ```
  curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
  chmod 700 get_helm.sh
  ./get_helm.sh
  rm -f ./get_helm.sh
  ```

# Nova Forge access and setup for
Nova Forge access and setup

To set up Amazon Nova Forge for use with Jobs, you need to:
+ Subscribe to Amazon Nova Forge
+ Set up a cluster

**Topics**
+ [

# Subscribe to Amazon Nova Forge
](nova-forge-subscribing.md)
+ [

# Set up infrastructure
](nova-forge-hyperpod-setup.md)
+ [

# Responsible AI
](nova-forge-responsible-ai.md)

# Subscribe to Amazon Nova Forge
Subscribe to Nova Forge

To access Amazon Nova Forge features, complete the following steps:

1. Verify administrator access to the AWS account.

1. Navigate to the SageMaker AI console and [ request access to Amazon Nova Forge](nova-forge.md).

1. Wait for the Amazon Nova team to email a confirmation after the subscription request is approved.

1. Tag your execution role with the `forge-subscription` tag. This tag is required for accessing Amazon Nova Forge features and checkpoints. Add the following tag to your execution role:
   + Key: `forge-subscription`
   + Value: `true`

**Note**  
Standard Amazon Nova features remain available without a Forge subscription. Amazon Nova Forge is designed for building custom frontier models with control and flexibility across all model training phases.

# Set up infrastructure
Set up HyperPod infrastructure

Once your Amazon Nova Forge subscription is approved, set up the necessary infrastructure to use Forge-enabled features. For detailed instructions on creating a EKS cluster with a restricted instance group (RIG), follow the [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US) instructions.

# Responsible AI
Responsible AI

**Content moderation settings**: Amazon Nova Forge customers have access to Customizable Content Moderation Settings (CCMS) for Amazon Nova Lite 1.0 and Pro 1.0 models. CCMS allows you to adjust content moderation controls to align with your specific business requirements while maintaining essential responsible AI safeguards. To determine if your business use case qualifies for CCMS, contact your Amazon Web Services account manager.

Amazon Nova Forge provides a Responsible AI Toolkit that includes training data, evaluation benchmarks, and runtime controls to help you align your models with Amazon Nova's responsible AI guidelines.

**Training data**: The "RAI" category in data mixing contains cases and scenarios emphasizing responsible AI principles, safety considerations, and responsible technology deployment. Use these to align your models responsibly during continued pre-training.

**Evaluations**: Benchmark tasks are available to test your model's ability to detect and reject inappropriate, harmful, or incorrect content. Use these evaluations to measure the difference between base model performance and your custom model performance.

# Training for Amazon Nova models
Training

Training Amazon Nova models on SageMaker HyperPod supports multiple techniques including Continued Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Fine-Tuning (RFT). Each technique serves different customization needs and can be applied to different Amazon Nova model versions.

**Topics**
+ [

# Continued pre-training (CPT)
](nova-cpt.md)
+ [

# Supervised fine-tuning (SFT)
](nova-fine-tune.md)
+ [

# Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod
](nova-hp-rft.md)

# Continued pre-training (CPT)
Continued pre-training (CPT)

Continued pre-training (CPT) is a training technique that extends the pre-training phase of a foundation model by exposing it to additional unlabeled text from specific domains or corpora. Unlike supervised fine-tuning, which requires labeled input-output pairs, CPT trains on raw documents to help the model acquire deeper knowledge of new domains, learn domain-specific terminology and writing patterns, and adapt to particular content types or subject areas.

This approach is particularly valuable when you have large volumes (tens of billions of tokens) of domain-specific text data, such as legal documents, medical literature, technical documentation, or proprietary business content, and you want the model to develop native fluency in that domain. Generally, after the CPT stage, the model needs to undergo additional instruction tuning stages to enable the model to use the newly acquired knowledge and complete useful tasks.

**Supported models**  
CPT is available for the following Amazon Nova models:
+ Nova 1.0 (Micro, Lite, Pro)
+ Nova 2.0 (Lite)

**When to use Nova 1.0 versus Nova 2.0**  
The Amazon Nova family of models offers multiple price-performance operating points to optimize between accuracy, speed, and cost.

Choose Nova 2.0 when you need the following:
+ Advanced reasoning capabilities for complex analytical tasks
+ Superior performance on coding, math, and scientific problem-solving
+ Longer context length support
+ Better multilingual performance

**Note**  
The larger model is not always better. Consider the cost-performance tradeoff and your specific business requirements when selecting between Nova 1.0 and Nova 2.0 models.

# CPT on Nova 2.0


Amazon Nova Lite 2.0 is a reasoning model trained on larger and more diverse datasets than Nova Lite 1.0. Despite being a larger model, Nova Lite 2.0 delivers faster inference than Nova Lite 1.0 while offering enhanced reasoning capabilities, longer context lengths, and improved multilingual performance.

CPT on Nova 2.0 allows you to extend these advanced capabilities with your domain-specific data, enabling the model to develop deep expertise in specialized areas while maintaining its superior reasoning and analytical abilities.

## Sample CPT recipe


The following is a sample recipe for CPT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/nova) repository.

```
# Note:
# This recipe can run on p5.48xlarge
# Run config
run:
  name: "my-cpt-run"                           # A descriptive name for your training job
  model_type: "amazon.nova-2-lite-v1:0:256k"   # Model variant specification, do not change
  model_name_or_path: "nova-lite-2/prod"        # Base model path, do not change
  replicas: 8                                   # Number of compute instances for training, allowed values are 4, 8, 16, 32
  data_s3_path: ""                              # Customer data paths
  validation_data_s3_path: ""                   # Customer validation data paths
  output_s3_path: ""                            # Output artifact path,  job-specific configuration - not compatible with standard SageMaker Training Jobs
  mlflow_tracking_uri: ""                       # Required for MLFlow
  mlflow_experiment_name: "my-cpt-experiment"   # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-cpt-run"                 # Optional for MLFlow. Note: leave this field non-empty

## Training specific configs
training_config:
  task_type: cpt
  max_length: 8192                              # Maximum context window size (tokens)
  global_batch_size: 256                        # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
    max_steps: 10                               # The number of training steps to run total
    val_check_interval: 10                      # The number of steps between running validation. Integer count or float percentage
    limit_val_batches: 2                        # Batches of the validation set to use each trigger

  model:
    hidden_dropout: 0.0                         # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0                      # Dropout for attention weights, must be between 0.0 and 1.0

  optim:
    optimizer: adam
    lr: 1e-5                                    # Learning rate
    name: distributed_fused_adam                # Optimizer algorithm, do not change
    adam_w_mode: true                           # Enable AdamW mode
    eps: 1e-06                                  # Epsilon for numerical stability
    weight_decay: 0.0                           # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                             # Beta1 for Adam optimizer
    adam_beta2: 0.95                            # Beta2 for Adam optimizer
    sched:
      warmup_steps: 10                          # Learning rate warmup steps
      constant_steps: 0                         # Steps at constant learning rate
      min_lr: 1e-6                              # Minimum learning rate, must be lower than lr
```

## Data preparation for CPT on 2.0


**Data format requirements**  
Training and validation datasets must be JSONL files following the format shown below, where each line contains a JSON object representing a conversation with the required fields and structure. Here is an example:

```
{"text": "AWS stands for Amazon Web Services"}
{"text": "Amazon SageMaker is a fully managed machine learning service"}
{"text": "Amazon Bedrock is a fully managed service for foundation models"}
```

Text entries should contain naturally flowing, high-quality content that represents the target domain.

Test that the data is capable of being converted into [Arrow format](https://huggingface.co/docs/datasets/en/about_arrow). Use the python script below to help with it. Ensure the `datasets==2.18.0` version at minimum is used:

```
from datasets import load_dataset, load_from_disk
from pathlib import Path

input_path = Path("<Your jsonl file>")
output_path = Path("<Your output directory>")

dataset = load_dataset("json", data_files=str(input_path), split="train")
dataset.save_to_disk(str(output_path), max_shard_size="1GB")

try:
  test_dataset = datasets.load_from_disk(output_dir)
  print(f"Dataset loaded successfully ✅! Contains {len(test_dataset)} samples")
except Exception as e:
  print(e)
```

It should print the same number of lines that were in the JSONL file.

When using datamixing, run the first job with `max_steps=2`. This will help create optimizations in the cluster for data access and validate that all the datamixes are available.

**How to prepare data for CPT**  
Training data is the most crucial determining factor for the success of continuous pre-training. While CPT data is often described as "unlabeled," the reality is far more nuanced. How data is structured, formatted, and presented determines whether the model will acquire the knowledge and skills required for the business use case.

### Preparing structured business datasets for CPT


This is a common challenge for companies and organizations building foundation models specialized in their domain. Most businesses possess rich repositories of structured data: product catalogs, user profiles, transaction logs, form submissions, API calls, and operational metadata. At first glance, this looks very different from the unstructured web text typically used in standard pre-training.

To effectively learn from structured business data, think carefully about downstream tasks and design the data presentation to force the model to learn the right predictive relationships.

To unlock the full potential of continuous pre-training, consider:
+ What tasks the model should perform at inference time
+ What information is present in the raw data
+ How to structure that data so the model learns to extract and manipulate the information correctly

Simply dumping structured data into training won't teach the model to reason about it. Actively shape the data presentation to guide what the model learns.

In the following sections, there is literature review demonstrating the importance of data augmentation and provide examples augmentation strategies for structured business data that will give useful ideas on how to treat and organize business dataset for CPT.

**Structured data for CPT in the literature**  
CPT can pack domain facts into the model but often fails to make those facts retrievable and manipulable when inputs or tasks shift. Controlled experiments show that without diverse augmentation during pretraining, models memorize facts in brittle ways that remain hard to extract even after later instruction tuning, and they recommend injecting instruction like signals early in training. For semi structured data, randomized serialization and other augmentations reduce schema overfitting, which is why CPT should be interleaved with instruction style tasks rather than run first and IFT later. Finance focused work further finds that jointly mixing CPT and instruction data at batch time improves generalization and reduces forgetting versus the sequential recipe. Qwen technical report converges on the same pattern by integrating high quality instruction data into pretraining itself, which boosts in context learning and preserves instruction following while acquiring new domain knowledge.

Data augmentation for semi structured corpora is a key lever. Synthetic graph aware CPT expands small domain sets into entity linked corpora that explicitly teach relationships and compounds with retrieval at inference time. Joint CPT plus instruction mixing outperforms sequential pipelines in finance and balancing domain with general data lowers degradation on general skills. Very large scale domain CPT can also retain broad ability and even allow trade offs through model merging, yet still points to instruction tuning as an essential next step, reinforcing the value of introducing instruction signals during CPT.

**Injecting diversity through randomization and shuffling**  
A general strategy that helps to teach model effectively from the structured and semi structured datasets is to shuffle the order of fields in the datasets, and even randomly drop out some keys.

Shuffling the fields forces the model to read what each value means instead of where it appears and learn the relationships between all the fields. For example, in case of an video game posted on amazon store, when "Title," "Platform," "Price," "Condition," and "Edition" arrive in different permutations, the model can't rely on "the third slot is platform"; it must bind labels to values and learn the bilateral relationships among attributes: title ⇄ platform, platform ⇄ price, condition ⇄ price. So it can, for example, infer a likely platform from a game name and an observed price, or estimate a plausible price range given a title and platform.

Randomly dropping keys during serialization acts like feature dropout: it prevents co-adaptation on any one field and forces the model to recover missing information from the remaining evidence. If "Platform" is absent, the model must pick it up from the title string or compatibility text; if "Price" is hidden, it has to triangulate from platform, edition, and condition. This builds symmetry (A→B and B→A), robustness to messy real-world listings, and schema invariance when fields are missing, renamed, or reordered.

An shopping-style example makes it concrete. Serialize the same item multiple ways—"Title: 'Elden Ring' \$1 Platform: PlayStation 5 \$1 Condition: Used—Like New \$1 Price: \$134.99" and a permutation like "Price: \$134.99 \$1 Title: 'Elden Ring' \$1 Condition: Used—Like New \$1 Platform: PlayStation 5"—and on some passes drop "Platform" while leaving "Compatible with PS5" in the description. Train complementary objectives such as predicting platform from \$1title, price\$1 and predicting a price bucket from \$1title, platform\$1. Because order and even presence of keys vary, the only stable strategy is to learn the true relationships between attributes rather than memorize a template.

### The way data is presented matters


LLMs learn by predicting the next token from what they have already seen. So the order of fields and events shown during training decides what the model can learn. If the training format matches the real task, the loss lands on the exact decision tokens. If fields are tossed together without structure, the model learns shortcuts or memorizes popularity and then fails when asked to choose among options.

Show the situation first, then the options, then the decision. If the model should also learn about outcomes or explanations, put them after the decision.

### Packing samples for CPT


**What is packing?**  
It simply means to fill each sequence window in the training data with multiple whole examples so the window is dense with real tokens, not padding.

**Why it matters**  
During training a maximum context length is set, for example 8,192 tokens. Batches are shaped to [batch size × context length]. If a training example is shorter than the context length, the remaining positions are padded. Padding still runs through attention and MLP kernels even if loss is masked, so compute is paid for tokens that carry no learning signal.

**How to do packing?**  
To pack multiple samples, concatenate multiple training samples with a ` [DOC] ` separator in between (note the space before and after the [DOC] ) such that the full length of the samples are under the desired context length.

An example packed document would look like this:

```
{"text": "training sample 1 [DOC] training sample 2 [DOC] training sample 3"}
```

### CPT Tuning Parameters


The parameters that are available for fine-tuning with CPT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging
+ **mlflow\$1experiment\$1name**: MLFlow experiment name
+ **mlflow\$1run\$1name**: MLFlow run name

**Training Configuration**  

+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 8192 tokens for CPT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.

**Trainer Settings**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` no. of elements

**Model Settings**  

+ **hidden\$1dropout**: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
+ **attention\$1dropout**: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **name**: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

# Supervised fine-tuning (SFT)
Supervised fine-tuning (SFT)

The SFT training process consists of two main stages:
+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using , this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (epochs, learning rate, batch size)
  + Optional components (distributed training parameters, etc)

## Nova Model Comparison and Selection


Amazon Nova 2.0 is a model trained on a larger and more diverse dataset than Amazon Nova 1.0. Key improvements include:
+ **Enhanced reasoning abilities** with explicit reasoning mode support
+ **Broader multilingual performance** across additional languages
+ **Improved performance on complex tasks** including coding and tool use
+ **Extended context handling** with better accuracy and stability at longer context lengths

## When to Use Nova 1.0 vs. Nova 2.0


Choose Amazon Nova 2.0 when:
+ Superior performance with advanced reasoning capabilities is needed
+ Multilingual support or complex task handling is required
+ Better results on coding, tool calling, or analytical tasks are needed

# SFT on Nova 2.0


Amazon Nova Lite 2.0 brings enhanced capabilities for supervised fine-tuning, including advanced reasoning mode, improved multimodal understanding, and extended context handling. SFT on Nova 2.0 enables you to adapt these powerful capabilities to your specific use cases while maintaining the model's superior performance on complex tasks.

Key features of SFT on Nova 2.0 include:
+ **Reasoning mode support**: Train models to generate explicit reasoning traces before final answers for enhanced analytical capabilities.
+ **Advanced multimodal training**: Fine-tune on document understanding (PDF), video understanding, and image-based tasks with improved accuracy.
+ **Tool calling capabilities**: Train models to effectively use external tools and function calling for complex workflows.
+ **Extended context support**: Leverage longer context windows with better stability and accuracy for document-intensive applications.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

**Topics**
+ [

## Reasoning Mode Selection (Nova 2.0 Only)
](#nova-sft-2-reasoning-mode)
+ [

## Tool calling data format
](#nova-sft-2-tool-calling)
+ [

## Document understanding data format
](#nova-sft-2-document-understanding)
+ [

## Video Understanding for SFT
](#nova-sft-2-video-understanding)
+ [

## Data Upload Instructions
](#nova-sft-2-data-upload)
+ [

## Creating a Fine-Tuning Job
](#nova-sft-2-creating-job)
+ [

## SFT Tuning Parameters
](#nova-sft-2-tuning-parameters)
+ [

## Hyperparameter Guidance
](#nova-sft-2-hyperparameters)

## Sample SFT recipe


Below is a sample recipe for SFT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova) repository.

```
run:
  name: my-full-rank-sft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://my-bucket-name/train.jsonl  #  only and not compatible with SageMaker Training Jobs
  replicas: 4                                     # Number of compute instances for training, allowed values are 4, 8, 16, 32
  output_s3_path: s3://my-bucket-name/outputs/    # Output artifact path (HyperPod job-specific; not compatible with standard SageMaker Training Jobs)
  mlflow_tracking_uri: ""                         # Required for MLFlow
  mlflow_experiment_name: "my-full-rank-sft-experiment"  # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-full-rank-sft-run"         # Optional for MLFlow. Note: leave this field non-empty

training_config:
  max_steps: 100                    # Maximum training steps. Minimal is 4.
  save_steps: ${oc.select:training_config.max_steps}  # How many training steps the checkpoint will be saved
  save_top_k: 5                     # Keep top K best checkpoints. Note supported only for  jobs. Minimal is 1.
  max_length: 32768                 # Sequence length (options: 8192, 16384, 32768 [default], 65536)
  global_batch_size: 32             # Global batch size (options: 32, 64, 128)
  reasoning_enabled: true           # If data has reasoningContent, set to true; otherwise False

  lr_scheduler:
    warmup_steps: 15                # Learning rate warmup steps. Recommend 15% of max_steps
    min_lr: 1e-6                    # Minimum learning rate, must be between 0.0 and 1.0

  optim_config:                     # Optimizer settings
    lr: 1e-5                        # Learning rate, must be between 0.0 and 1.0
    weight_decay: 0.0               # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                  # Exponential decay rate for first-moment estimates
    adam_beta2: 0.95                 # Exponential decay rate for second-moment estimates

  peft:                             # Parameter-efficient fine-tuning (LoRA)
    peft_scheme: "null"             # Disable LoRA for PEFT
```

## Reasoning Mode Selection (Nova 2.0 Only)


Amazon Nova 2.0 supports reasoning mode for enhanced analytical capabilities:
+ **Reasoning Mode (enabled)**:
  + Set `reasoning_enabled: true` in the training configuration
  + Model trains to generate reasoning traces before final answers
  + Improves performance on complex reasoning tasks
+ **Non-Reasoning Mode (disabled)**:
  + Set `reasoning_enabled: false` or omit the parameter (default)
  + Standard SFT without explicit reasoning
  + Suitable for tasks that don't benefit from step-by-step reasoning

**Note**  
When reasoning is enabled, it operates at high reasoning effort. There is no low reasoning option for SFT.
Multimodal reasoning content is not supported for SFT. Reasoning mode applies to text-only inputs.

### Using reasoning mode with non-reasoning datasets


Training Amazon Nova on a non-reasoning dataset with `reasoning_enabled: true` is permitted. However, doing so may cause the model to lose its reasoning capabilities, as Amazon Nova primarily learns to generate the responses presented in the data without applying reasoning.

If training Amazon Nova on a non-reasoning dataset but still want to use reasoning during inference:

1. Disable reasoning during training (`reasoning_enabled: false`)

1. Enable reasoning later during inference

While this approach allows reasoning at inference time, it does not guarantee improved performance compared to inference without reasoning.

**Best practice:** Enable reasoning for both training and inference when using reasoning datasets, and disable it for both when using non-reasoning datasets.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Tool calling data format


SFT supports training models to use tools (function calling). Below is a sample input format for tool calling:

**Sample input:**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are an expert in composing function calls."
    }
  ],
  "toolConfig": {
    "tools": [
      {
        "toolSpec": {
          "name": "getItemCost",
          "description": "Retrieve the cost of an item from the catalog",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "item_name": {
                  "type": "string",
                  "description": "The name of the item to retrieve cost for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to retrieve cost for"
                }
              },
              "required": [
                "item_id"
              ]
            }
          }
        }
      },
      {
        "toolSpec": {
          "name": "getItemAvailability",
          "description": "Retrieve whether an item is available in a given location",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "zipcode": {
                  "type": "string",
                  "description": "The zipcode of the location to check in"
                },
                "quantity": {
                  "type": "integer",
                  "description": "The number of items to check availability for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to check availability for"
                }
              },
              "required": [
                "item_id", "zipcode"
              ]
            }
          }
        }
      }
    ]
  },
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "I need to check whether there are twenty pieces of the following item available. Here is the item ASIN on Amazon: id-123. Please check for the zipcode 94086"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "reasoningContent": {
            "reasoningText": {
              "text": "The user wants to check how many pieces of the item with ASIN id-123 are available in the zipcode 94086"
            }
          }
        },
        {
          "toolUse": {
            "toolUseId": "getItemAvailability_0",
            "name": "getItemAvailability",
            "input": {
              "zipcode": "94086",
              "quantity": 20,
              "item_id": "id-123"
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "toolResult": {
            "toolUseId": "getItemAvailability_0",
            "content": [
              {
                "text": "[{\"name\": \"getItemAvailability\", \"results\": {\"availability\": true}}]"
              }
            ]
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Yes, there are twenty pieces of item id-123 available at 94086. Would you like to place an order or know the total cost?"
        }
      ]
    }
  ]
}
```

Important considerations for tool calling data:
+ ToolUse must appear in assistant turns only
+ ToolResult must appear in user turns only
+ ToolResult should be text or JSON only; other modalities are not currently supported for Amazon Nova models
+ The inputSchema within the toolSpec must be a valid JSON Schema object
+ Each ToolResult must reference a valid toolUseId from a preceding assistant ToolUse, with each toolUseId used exactly once per conversation

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Document understanding data format


SFT supports training models on document understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "document": {
            "format": "pdf",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/documents/customer_service_debugging.pdf",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the document to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for document understanding:
+ Only PDF files are supported
+ Maximum document size is 10 MB
+ A sample can contain documents and text, but cannot mix documents with other modalities (such as images or video)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Video Understanding for SFT


SFT supports fine-tuning models for video understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "video": {
            "format": "mp4",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/videos/customer_service_debugging.mp4",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the video to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for video understanding:
+ Videos can be a maximum of 50 MB
+ Videos can be up to 15 minutes long
+ Only one video is allowed per sample; multiple videos in the same sample are not supported
+ A sample can contain video and text, but cannot mix video with other modalities (such as images or documents)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Data Upload Instructions


Upload training and validation datasets to an S3 bucket. Specify these locations in the recipe's `run` block:

```
## Run config
run:
  ...
  data_s3_path: "s3://<bucket-name>/<training-directory>/<training-file>.jsonl"
```

**Note**: Replace `<bucket-name>`, `<training-directory>`, `<validation-directory>`, `<training-file>`, and `<validation-file>` with actual S3 paths.

**Note**: Validation datasets are not currently supported for SFT with Amazon Nova 2.0. If a validation dataset is provided, it will be ignored.

## Creating a Fine-Tuning Job


Define the base model using the `model_type` and `model_name_or_path` fields in the `run` block:

```
## Run config
run:
  ...
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  ...
```

## SFT Tuning Parameters


The parameters that are available for tuning with SFT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging.
+ **mlflow\$1experiment\$1name**: MLFlow experiment name.
+ **mlflow\$1run\$1name**: MLFlow run name.

**Training Configuration**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` number of elements.
+ **save\$1steps**: The frequency (in steps) at which to save model checkpoints during training.
+ **save\$1top\$1k**: The maximum number of best checkpoints to retain based on validation metrics.
+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value is 32768 tokens for SFT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.
+ **reasoning\$1enabled**: Boolean flag to enable reasoning capabilities during training.

**Learning Rate Scheduler**  

+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **adam\$1beta1**: The exponential decay rate for the first moment estimates in Adam optimizer. Default is 0.9.
+ **adam\$1beta2**: The exponential decay rate for the second moment estimates in Adam optimizer. Default is 0.95.

**PEFT Configuration**  

+ **peft\$1scheme**: The parameter-efficient fine-tuning scheme to use. Options are `'null'` for full-rank fine-tuning or `lora` for LoRA-based fine-tuning.

**LoRA Tuning (when peft\$1scheme is 'lora')**  

+ **alpha**: The LoRA scaling parameter. Controls the magnitude of the low-rank adaptation. Typical values range from 8 to 128.
+ **lora\$1plus\$1lr\$1ratio**: The learning rate ratio for LoRA\$1 optimization. This multiplier adjusts the learning rate specifically for LoRA parameters.

## Hyperparameter Guidance


Use the following recommended hyperparameters based on the training approach:

**Full Rank Training**
+ **Epochs**: 1
+ **Learning rate (lr)**: 1e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**LoRA (Low-Rank Adaptation)**
+ **Epochs**: 2
+ **Learning rate (lr)**: 5e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**Note**: Adjust these values based on dataset size and validation performance. Monitor training metrics to prevent overfitting.

# Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod
Reinforcement Fine-Tuning (RFT)

Reinforcement Fine-Tuning (RFT) is a machine learning technique that improves model performance through feedback signals—measurable scores or rewards indicating response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards.

This approach is particularly effective for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications requiring nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that can be programmatically evaluated.

**When to use RFT**  
Use RFT when you can define clear, measurable success criteria but struggle to provide exact correct outputs for training. It's ideal for tasks where quality is subjective or multifaceted—such as creative writing, code optimization, or complex reasoning—where multiple valid solutions exist but some are clearly better than others.

RFT works best when you have the following:
+ A reliable reward function that can evaluate model outputs programmatically
+ Need to align model behavior with specific preferences or constraints
+ Situations where traditional supervised fine-tuning falls short because collecting high-quality labeled examples is expensive or impractical

Consider RFT for applications requiring iterative improvement, personalization, or adherence to complex business rules that can be encoded as reward signals.

**What RFT is best suited for**  
RFT excels in domains where output quality can be objectively measured but optimal responses are difficult to define upfront:
+ **Mathematical problem-solving**: Verifiable correctness with multiple solution paths
+ **Code generation and optimization**: Testable execution results and performance metrics
+ **Scientific reasoning tasks**: Logical consistency and factual accuracy
+ **Structured data analysis**: Programmatically verifiable outputs
+ **Multi-step reasoning**: Tasks requiring step-by-step logical progression
+ **Tool usage and API calls**: Success measurable by execution results
+ **Complex workflows**: Adherence to specific constraints and business rules

RFT works exceptionally well when you need to balance multiple competing objectives like accuracy, efficiency, and style.

**When to use reasoning mode for RFT training**  
Amazon Nova 2.0 supports reasoning mode during RFT training. The following modes are available:
+ **none**: No reasoning (omit the reasoning\$1effort field)
+ **low**: Minimal reasoning overhead
+ **high**: Maximum reasoning capability (default when reasoning\$1effort is specified)

**Note**  
There is no medium option for RFT. If the reasoning\$1effort field is absent from your configuration, reasoning is disabled.

Use high reasoning for the following:
+ Complex analytical tasks
+ Mathematical problem-solving
+ Multi-step logical deduction
+ Tasks where step-by-step thinking adds value

Use none (omit reasoning\$1effort) or low reasoning for the following:
+ Simple factual queries
+ Direct classifications
+ Speed and cost optimization
+ Straightforward question-answering

**Important**  
Higher reasoning modes increase training time and cost, inference latency and cost, but also increase model capability for complex reasoning tasks.

**Supported models**  
RFT onSageMaker HyperPod supports Amazon Nova Lite 2.0 (amazon.nova-2-lite-v1:0:256k).

**Major steps**  
The RFT process involves four key phases:
+ **Implementing an evaluator**: Create a reward function to programmatically score model responses based on your quality criteria.
+ **Uploading prompts**: Prepare and upload training data in the specified conversational format with reference data for evaluation.
+ **Starting a job**: Launch the reinforcement fine-tuning process with your configured parameters.
+ **Monitoring**: Track training progress through metrics dashboards to ensure the model learns effectively.

Each step builds on the previous one, with the evaluator serving as the foundation that guides the entire training process by providing consistent feedback signals.

**Topics**
+ [

# RFT on Nova 2.0
](nova-hp-rft-nova2.md)

# RFT on Nova 2.0


RFT training data follows the OpenAI conversational format. Each training example is a JSON object containing messages, reference answers, and optional tool definitions. This section provides guidance on preparing effective training data for RFT on Nova 2.0.

**Topics**
+ [

## Data format and structure
](#nova-hp-rft-data-format)
+ [

## Field descriptions
](#nova-hp-rft-field-descriptions)
+ [

## Hyperparameter guidance
](#nova-hp-rft-monitoring-hyperparams)
+ [

## Additional properties
](#nova-hp-rft-additional-properties)
+ [

## Dataset size recommendations
](#nova-hp-rft-dataset-size)
+ [

## Characteristics of effective training data
](#nova-hp-rft-effective-data)
+ [

# Monitoring RFT training
](nova-hp-rft-monitoring.md)

## Data format and structure


Each training example is a JSON object containing the following:
+ **messages**: An array of conversational turns using system, user, and optionally assistant roles
+ **reference\$1answer**: Expected output or evaluation criteria for reward calculation
+ **tools** (optional): Array of function definitions available to the model
+ **id** (optional): Unique identifier for tracking and deduplication

Each example should be on a single line in your JSONL file, with one JSON object per line.

### Example 1: Chemistry problem


The following example shows a chemistry problem with reference answer containing ground truth values:

```
{  
  "id": "chem-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful chemistry assistant"  
    },  
    {  
      "role": "user",  
      "content": "Predict hydrogen bond donors and acceptors for this SMILES: CCN(CC)CCC(=O)c1sc(N)nc1C"  
    }  
  ],  
  "reference_answer": {  
    "donor_bond_counts": 2,  
    "acceptor_bond_counts": 4,  
    "explanation": "Calculated using Lipinski's rule of five: N-H groups (2 donors), N and O atoms with lone pairs (4 acceptors)"  
  }  
}
```

**Note**  
The reference\$1answer contains ground truth values calculated using domain-specific rules. Your reward function compares the model's predicted values against these reference values to calculate a reward score.

### Example 2: Math problem


The following example shows a math problem with solution steps:

```
{  
  "id": "math-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  }  
}
```

### Example 3: Tool usage


The following example shows tool usage with expected behavior:

```
{  
  "id": "tool-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful game master assistant"  
    },  
    {  
      "role": "user",  
      "content": "Generate a strength stat for a warrior character. Apply a +2 racial bonus modifier."  
    }  
  ],  
  "tools": [  
    {  
      "type": "function",  
      "function": {  
        "name": "StatRollAPI",  
        "description": "Generates character stats by rolling 4d6, dropping the lowest die result, and applying a modifier.",  
        "parameters": {  
          "type": "object",  
          "properties": {  
            "modifier": {  
              "description": "An integer representing the modifier to apply to the total of the stat roll.",  
              "type": "integer"  
            }  
          },  
          "required": ["modifier"]  
        }  
      }  
    }  
  ],  
  "reference_answer": {  
    "tool_called": "StatRollAPI",  
    "tool_parameters": {  
      "modifier": 2  
    },  
    "expected_behavior": "Call StatRollAPI with modifier=2 and return the calculated stat value"  
  }  
}
```

## Field descriptions


| Field | Description | Additional notes | Required | 
| --- |--- |--- |--- |
| id | Unique identifier for this RFT example | String (for example, "sample-001"). Useful for tracking and deduplication. | No | 
| messages | Ordered list of chat messages that define the prompt and context | Array of objects. Model sees them in order. Typically starts with a system message, then user. | Yes | 
| messages[].role | Who is speaking in the message | Common values: "system", "user" (sometimes "assistant" in other contexts) | No | 
| messages[].content | The text content of the message | Plain string. For system it's instructions, for user it's the task or input. | No | 
| tools | Tool specifications available to the model during this example | Array. Each item defines a tool's interface and metadata. Types may include "function" or "internal". | No | 
| reference\$1answer | The expected model output for this example | String or object depending on task. Used as target for evaluation or training. | No | 

**Note**  
Any additional custom fields (for example, task\$1id, difficulty\$1level, context\$1data) are not validated and will be passed to your reward function as metadata.

## Hyperparameter guidance


Use the following recommended hyperparameters based on your training approach:

**General:**
+ Epochs: 1
+ Learning rate (lr): 1e-7
+ Number of generations: 8
+ Max new tokens: 8192
+ Batch size: 256

**LoRA (Low-Rank Adaptation):**
+ LoRA Rank: 32

**Note**  
Adjust these values based on your dataset size and validation performance. Monitor training metrics to prevent overfitting.

## Additional properties


The "additionalProperties": true setting allows you to include custom fields beyond the core schema requirements, providing flexibility to add any data your reward function needs for proper evaluation.

### Common additional fields


You can include the following types of additional fields:

**Metadata:**
+ task\$1id: Unique identifier for tracking
+ difficulty\$1level: Problem complexity indicator
+ domain: Subject area or category
+ expected\$1reasoning\$1steps: Number of steps in solution

**Evaluation criteria:**
+ evaluation\$1criteria: Specific grading rubrics
+ custom\$1scoring\$1weights: Relative importance of different aspects
+ context\$1data: Background information for the problem
+ external\$1references: Links to relevant documentation or resources

### Example with additional properties


The following example includes custom metadata fields:

```
{  
  "id": "algebra_001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  },  
  "task_id": "algebra_001",  
  "difficulty_level": "easy",  
  "domain": "algebra",  
  "expected_reasoning_steps": 3  
}
```

## Dataset size recommendations


### Starting point


Begin with the following minimum dataset sizes:
+ Minimum 100 training examples
+ Minimum 100 evaluation examples

Prioritize high-quality input data and a reliable reward function that executes consistently on model responses.

### Evaluation-first approach


Before investing in large-scale RFT training, evaluate your model's baseline performance:
+ **High performance (greater than 95% reward)**: RFT may be unnecessary—your model already performs well
+ **Very poor performance (0% reward)**: Switch to SFT first to establish basic capabilities
+ **Moderate performance**: RFT is likely appropriate

This evaluation-first approach ensures your reward function is bug-free and determines if RFT is the right method for your use case. Starting small allows you to get comfortable with the RFT workflow, identify and fix issues early, validate your approach before scaling up, and test reward function reliability. Once validated, you can expand to larger datasets to further improve performance.

## Characteristics of effective training data


### Clarity and consistency


Good RFT examples require clear, unambiguous input data that enables accurate reward calculation across different model outputs. Avoid noise in your data, including:
+ Inconsistent formatting
+ Contradictory labels or instructions
+ Ambiguous prompts
+ Conflicting reference answers

Any ambiguity will mislead the training process and cause the model to learn unintended behaviors.

### Diversity


Your dataset should capture the full diversity of production use cases to ensure robust real-world performance. Include:
+ Various problem types and difficulty levels
+ Different input formats and edge cases
+ Representative samples from all expected scenarios

This diversity helps prevent overfitting and ensures the model handles unfamiliar inputs gracefully.

### Reward function considerations


Design your reward function for efficient training:
+ Execute within seconds (not minutes)
+ Parallelize effectively with Lambda
+ Return consistent, reliable scores
+ Handle different types of model outputs gracefully

Fast, scalable reward functions enable rapid iteration and cost-effective experimentation at scale.

# Monitoring RFT training


Monitor key metrics during training to ensure effective learning and identify potential issues early.

**Topics**
+ [

## Key metrics to track
](#nova-hp-rft-monitoring-metrics)
+ [

## Evaluation after RFT
](#nova-hp-rft-monitoring-evaluation)
+ [

## Using fine-tuned models
](#nova-hp-rft-monitoring-checkpoints)
+ [

## Limitations and best practices
](#nova-hp-rft-monitoring-limitations)
+ [

## Troubleshooting
](#nova-hp-rft-monitoring-troubleshooting)

## Key metrics to track


Monitor the following metrics using MlFlow during training:

**Reward metrics:**
+ **Average reward score**: Overall quality of model responses (should increase over time)
+ **Reward distribution**: Percentage of responses receiving high, medium, and low rewards
+ **Training vs. validation rewards**: Compare to detect overfitting

**Training metrics:**
+ **Policy updates**: Number of successful weight updates
+ **Rollout completion rate**: Percentage of samples successfully evaluated

**Concerning patterns:**
+ Rewards plateauing (indicates poor learning)
+ Validation rewards dropping while training rewards increase (overfitting)
+ Reward variance increasing significantly over time (instability)
+ High percentage of reward function errors (implementation issues)

**When to stop training:**
+ Target performance metrics are achieved
+ Rewards plateau and no longer improve
+ Validation performance degrades (overfitting detected)
+ Maximum training budget is reached

## Evaluation after RFT


After training completes, evaluate your fine-tuned model to assess performance improvements:
+ **Run RFT evaluation job**: Use the checkpoint from your RFT training as the model
+ **Compare to baseline**: Evaluate both base model and fine-tuned model on the same test set
+ **Analyze metrics**: Review task-specific metrics (accuracy, reward scores, etc.)
+ **Conduct qualitative review**: Manually inspect sample outputs for quality

For detailed evaluation procedures, see the Evaluation section.

## Using fine-tuned models


**Accessing checkpoints:**

After training completes, locate your checkpoint:

1. Navigate to your `output_path` in S3

1. Download and extract `output.tar.gz`

1. Open `manifest.json`

1. Copy the `checkpoint_s3_bucket` value

**Deploying for inference:**

Use the checkpoint S3 path for inference or further training:

```
run:
    model_type: amazon.nova-2-lite-v1:0:256k
    model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

For deployment and inference instructions, refer to the Inference section.

## Limitations and best practices


**Current limitations:**

**Beta restrictions:**
+ Need to create a new RIG group for RFT. This limitation will be resolved by GA.
+ Instance type requirements: Only P5 instances supported (minimum 8x P5.48xlarge). Coming Soon: Support for smaller instance types (ETA: mid-January 2025).

**Functional limitations:**
+ 15-minute Lambda timeout: Reward functions must complete within 15 minutes
+ Single-turn only: Multi-turn conversations not supported
+ Validation datasets: Not supported during training. Use separate evaluation jobs to assess training progress.

**Training considerations:**
+ Low reward scenarios: May struggle when less than 5% of examples receive positive rewards - consider SFT first
+ Data requirements: Needs sufficient diversity to learn effectively
+ Computational cost: More expensive than supervised fine-tuning

**Nova Forge removes some of these limitations:**
+ Supports multi-turn conversations
+ Allows reward functions exceeding 15-minute timeouts
+ Provides advanced algorithms and tuning options
+ Designed for complex enterprise use cases, specifically tuned to build frontier models

**Best practices:**

**Start small and scale:**
+ Begin with minimal datasets (100-200 examples) and few training epochs
+ Validate your approach before scaling up
+ Gradually increase dataset size and training steps based on results

**Baseline with SFT first:**
+ If reward scores are consistently low (e.g., always 0), perform SFT before RFT
+ RFT requires reasonable baseline performance to improve effectively

**Design efficient reward functions:**
+ Execute in seconds, not minutes
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement proper error handling
+ Test thoroughly before training
+ Leverage Lambda's parallel scaling capabilities

**Monitor training actively:**
+ Track average reward scores over time
+ Watch reward distribution across samples
+ Compare training vs. validation rewards
+ Look for concerning patterns (plateaus, overfitting, instability)

**Iterate based on results:**
+ If rewards don't improve after several iterations, adjust reward function design
+ Increase dataset diversity to provide clearer learning signals
+ Consider switching to SFT if rewards remain near zero
+ Experiment with different hyperparameters (learning rate, batch size)

**Optimize data quality:**
+ Ensure diverse, representative examples
+ Include edge cases and difficult samples
+ Verify reward function correctly scores all example types
+ Remove or fix samples that confuse the reward function

## Troubleshooting


**Reward function errors:**

Symptoms: High error rate in reward function calls during training


| Issue | Symptoms | Resolution | 
| --- |--- |--- |
| Lambda timeout | Frequent timeouts after 15 minutes | Optimize function performance; consider Nova Forge for complex evaluations | 
| Insufficient concurrency | Lambda throttling errors | Increase lambda\$1concurrency\$1limit or request quota increase | 
| Invalid return format | Training fails with format errors | Verify return structure matches required interface format | 
| Unhandled exceptions | Intermittent errors | Add comprehensive error handling and logging | 
| External API failures | Inconsistent scoring | Implement retry logic and fallback strategies | 

**Poor training performance:**

Symptoms: Rewards not improving or plateauing at low values

Resolutions:
+ **Verify reward function correctness**: Test with known good/bad examples
+ **Check baseline performance**: Evaluate base model; if near-zero accuracy, do SFT first
+ **Increase data diversity**: Add more varied examples covering different scenarios
+ **Adjust hyperparameters**: Try different learning rates or batch sizes
+ **Review reward signal quality**: Ensure rewards differentiate between good and bad responses

**Overfitting:**

Symptoms: Training rewards increase while validation rewards decrease

Resolutions:
+ **Reduce training steps**: Stop training earlier
+ **Increase dataset size**: Add more training examples
+ **Add regularization**: Adjust `weight_decay` or `entropy_coeff`
+ **Increase data diversity**: Ensure training set represents full distribution

# Evaluating your trained model
Evaluation

An evaluation recipe is a YAML configuration file that defines how your Amazon Nova model evaluation job is executed. With this recipe, you can assess the performance of a base or trained model against common benchmarks or your own custom datasets. Metrics can be stored in Amazon S3 or TensorBoard. The evaluation provides quantitative metrics that help you assess model performance across various tasks to determine if further customization is needed.

Model evaluation is an offline process, where models are tested against fixed benchmarks with predefined answers. They are not assessed in real-time or against live user interactions. For real-time evaluations, you can evaluate the model after it is deployed to Amazon Bedrock by calling the Amazon Bedrock runtime APIs.

**Important**  
The evaluation container only supports checkpoints produced by the same training platform. Checkpoints created with SageMaker HyperPod can only be evaluated using the SageMaker HyperPod evaluation workflow, and checkpoints created with SageMaker training jobs can only be evaluated using the SageMaker training jobs evaluation workflow. Attempting to evaluate a checkpoint from a different platform will result in failure.

**Topics**
+ [

# Available benchmark tasks
](customize-fine-tune-evaluate-available-tasks.md)
+ [

# Understanding the recipe parameters
](customize-fine-tune-evaluate-understand-modify.md)
+ [

# Evaluation recipe examples
](customize-fine-tune-evaluate-recipe-examples.md)
+ [

# Starting an evaluation job
](customize-fine-tune-evaluate-start-job.md)
+ [

# Accessing and analyzing evaluation results
](customize-fine-tune-evaluate-access-results.md)
+ [

# RFT evaluation
](nova-hp-evaluate-rft.md)

# Available benchmark tasks


A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of the supported, available industry standard benchmarks. You can specify the following benchmarks in the `eval_task` parameter:


| Benchmark | Modality | Description | Metrics | Strategy | Subtask Available | 
| --- |--- |--- |--- |--- |--- |
| mmlu | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs\$1cot | Yes | 
| mmlu\$1pro | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs\$1cot | No | 
| bbh | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | zs\$1cot | Yes | 
| gpqa | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs\$1cot | No | 
| math | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text | Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes | 
| IFEval | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No | 
| gen\$1qa | Text | Custom Dataset Evaluation – Lets you bring your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. | all | gen\$1qa | No | 
| llm\$1judge | Text | LLM-as-a-Judge Preference Comparison – Uses a Amazon Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A. | all | judge | No | 
| humaneval | Text | HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models | pass@1 | zs | No | 
|  mm\$1llm\$1judge  |  Multi-modal (image)  |  This new benchmark behaves the same as the text-based `llm_judge` above. The only difference is that it supports image inference.  |  all  |  judge  |  No  | 
|  rubric\$1llm\$1judge  | Text |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  | Text |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  | No | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 

The following `mmlu` subtasks are available:

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

The following `bbh` subtasks are available:

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

The following `math` subtasks are available:

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]
```

# Understanding the recipe parameters


**Run configuration**  
The following is a general run configuration and an explanation of the parameters involved.

```
run:
  name: eval_job_name
  model_type: amazon.nova-micro-v1:0:128k
  model_name_or_path: nova-micro/prod
  replicas: 1
  data_s3_path: ""
  output_s3_path: s3://output_path
  mlflow_tracking_uri: ""
  mlflow_experiment_name : ""
  mlflow_run_name : ""
```
+ `name`: (Required) A descriptive name for your evaluation job. This helps identify your job in the AWS console.
+ `model_type`: (Required) Specifies the Amazon Nova model variant to use. Do not manually modify this field. Options include:
  + `amazon.nova-micro-v1:0:128k`
  + `amazon.nova-lite-v1:0:300k`
  + `amazon.nova-pro-v1:0:300k`
  + `amazon.nova-2-lite-v1:0:256k`
+ `model_name_or_path`: (Required) The path to the base model or S3 path for the post-trained checkpoint. Options include:
  + `nova-micro/prod`
  + `nova-lite/prod`
  + `nova-pro/prod`
  + `nova-lite-2/prod`
  + (S3 path for the post-trained checkpoint) `s3://<escrow bucket>/<job id>/outputs/checkpoints`
+ `replicas`: (Required) The number of compute instances to use for distributed training. You must set this value to 1 because multi-node is not supported.
+ `data_s3_path`: (Required) The S3 path to the input dataset. Leave this parameter empty unless you are using the *bring your own dataset* or *LLM as a judge* recipe.
+ `output_s3_path`: (Required) The S3 path to store output evaluation artifacts. Note that the output S3 bucket must be created by the same account that is creating the job.
+ `mlflow_tracking_uri`: (Optional) MLflow tracking server ARN for tracking MLFlow runs/experiments. Please ensure you have permission to access the tracking server from SageMaker AI execution role

**Evaluation configuration**  
The following is a model evaluation configuration and an explanation of the parameters involved.

```
evaluation:
  task: mmlu
  strategy: zs_cot
  subtask: mathematics
  metric: accuracy
```
+ `task`: (Required) Specifies the evaluation benchmark or task to use.

  Supported task list:
  + mmlu
  + mmlu\$1pro
  + bbh
  + gpqa
  + math
  + strong\$1reject
  + gen\$1qa
  + ifeval
  + llm\$1judge
  + humaneval
  + mm\$1llm\$1judge
  + rubric\$1llm\$1judge
  + aime\$12024
  + calendar\$1scheduling
  + humaneval
+ `strategy`: (Required) Defines the evaluation approach:
  + zs\$1cot: Zero-shot Chain-of-Thought - An approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + zs: Zero-shot - An approach to solve a problem without any prior training examples.
  + gen\$1qa: A strategy specific for bring your own dataset recipes.
  + judge: A strategy specific for Amazon Nova LLM as Judge and mm\$1llm\$1judge.
+ `subtask`: (Optional and Removable) Specifies a specific subtask for certain evaluation tasks. Remove this from your recipe if your task does not have any subtasks.
+ `metric`: (Required) The evaluation metric to use.
  + accuracy: Percentage of correct answers
  + exact\$1match: (For `math` benchmark), returns the rate at which the input predicted strings exactly match their references.
  + deflection: (For `strong reject` benchmark), returns the relative deflection to the base model and the difference in significance metrics.
  + pass@1: (For `humaneval` benchmark) is a metric used to measures the percentage of cases where the model's highest confidence prediction matches the correct answer.
  + `all`: Returns the following metrics:
    + For `gen_qa` and bring your own dataset benchmark, return following metrics:
      + `rouge1`: Measures the overlap of unigrams (single words) between generated and reference text.
      + `rouge2`: Measures the overlap of bigrams (two consecutive words) between generated and reference text.
      + `rougeL`: Measures the longest common subsequence between texts, allowing for gaps in the matching.
      + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
      + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
      + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
      + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
      + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.
    + For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
      + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
      + `a_scores_stderr`: Standard error of `response_A scores` across pairwise judgements.
      + `b_scores`: Number of wins for `response_B` across forward and backward evaluation passes.
      + `b_scores_stderr`: Standard error of `response_B scores` across pairwise judgements.
      + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
      + `ties_stderr`: Standard error of ties across pairwise judgements.
      + `inference_error`: Count of judgements that could not be properly evaluated.
      + `inference_error_stderr`: Standard error of inference errors across judgements.
      + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
      + `score_stderr`: Standard error of the aggregate score across pairwise judgements.
      + `winrate`: the probability that response\$1B will be preferred over response\$1A calculated using Bradley-Terry probability.
      + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration**  
The following is an inference configuration and an explanation of the parameters involved. All parameters are optional.

```
inference:
  max_new_tokens: 200
  top_k: -1
  top_p: 1.0
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: The maximum number of tokens to generate. This must be an integer.
+ `top_k`: The number of highest probability tokens to consider. This must be an integer.
+ `top_p`: The cumulative probability threshold for token sampling. This must be a float between 0.0 and 1.0, inclusive.
+ `temperature`: Randomness in token selection. Larger values introduce more randomness. Use 0 to make the results deterministic. This value must be a float with a minimum value of 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are `null` (default value if not set; disables reasoning), `low`, or `high`.

Note that for `humaneval`, we recommend the following inference configuration:

```
inference:
  top_k: 1
  max_new_tokens: 1600
  temperature: 0.0
```

**MLFlow configuration**  
The following is an MLFlow configuration and an explanation of the parameters involved. All parameters are optional.

```
run:
  mlflow_tracking_uri: ""
  mlflow_experiment_name: ""
  mlflow_run_name: ""
```
+ `mlflow_tracking_uri`: Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

# Evaluation recipe examples


Amazon Nova provides four types of evaluation recipes, which are available in the SageMaker HyperPod recipes GitHub repository.

## General text benchmark recipes


These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. They are provided in the format `xxx_general_text_benchmark_eval.yaml`.

## Bring your own dataset benchmark recipes


These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. They are provided in the format `xxx_bring_your_own_dataset_eval.yaml`. 

The following are the bring your own dataset requirements:
+ File format requirements
  + You must include a single `gen_qa.jsonl` file containing evaluation examples.
  + Your dataset must be uploaded to an S3 location where SageMaker training job can access it.
  + The file must follow the required schema format for a general Q&A dataset.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `query`: (Required) String containing the question or instruction that needs an answer
  + `response`: (Required) String containing the expected model output
  + `system`: (Optional) String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query
  + `metadata`: (Optional) String containing metadata associated with the entry for tagging purposes.

Here is a bring your own data set example entry

```
{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, do not change any of the content:

```
evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 3.5k.

## Nova LLM as a Judge benchmark recipes


Amazon Nova LLM as a Judge is a model evaluation feature that enables customers to compare the quality of responses from one model to a baseline model response on a custom dataset. It takes in a dataset with prompts, baseline responses, and challenger responses, and uses a Amazon Nova Judge model to provide a winrate metric based on [Bradley-Terry probability](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) with pairwise comparisons.

The recipes are provided in the format `xxx_llm_judge_eval.yaml`. 

The following are the LLM as a Judge requirements:
+ File format requirements
  + Include a single `llm_judge.jsonl` file containing evaluation examples. The file name must be `llm_judge.jsonl`.
  + Your dataset must be uploaded to an S3 location that [SageMaker AI SageMaker HyperPod RIG](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html) can access.
  + The file must follow the required schema format for the `llm_judge.jsonl` dataset.
  + The input dataset should ensure all records are under 12k context length.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `prompt`: (Required) A string containing the prompt for the generated response.
  + `response_A`: A string containing the baseline response.
  + `response_B`: A string containing the alternative response be compared with baseline response.

Here is an LLM as a judge example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Amazon Nova Judge models are the same across all model family specifications (that is, Lite, Micro, and Pro).
+ Custom judge models are not supported at this time.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 7k.

## Nova LLM as a Judge for multi-modal (image) benchmark recipes


Nova LLM Judge for multi-modal (image), short for Amazon Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in the form of Base64-encoded string, then uses a Amazon Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge _eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker Training Jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Lite specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

## Rubric Based Judge


Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts (A>B, B>A, or tie), Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

Key capabilities:
+ **Dynamic criteria generation**: Automatically creates relevant evaluation dimensions based on the input prompt
+ **Weighted scoring**: Assigns importance weights to each criterion to reflect their relative significance
+ **Granular assessment**: Provides detailed scores on a binary (true/false) or scale (1-5) basis for each criterion
+ **Quality metrics**: Calculates continuous quality scores (0-1 scale) that quantify the magnitude of differences between responses

Example criterion generated by the model:

```
price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3
```

The model evaluates both responses against all generated criteria, then uses these criterion-level scores to inform its final preference decision.

**Topics**
+ [

### Recipe configuration
](#nova-hp-evaluate-rubric-judge-recipe)
+ [

### Input dataset format
](#nova-hp-evaluate-rubric-judge-input)
+ [

### Evaluation output
](#nova-hp-evaluate-rubric-judge-output)
+ [

### Reasoning model support
](#nova-hp-evaluate-rubric-judge-reasoning)

### Recipe configuration


**Rubric Judge recipe**  
Enable Rubric Judge by setting `task: rubric_llm_judge` in your recipe:

```
run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Original LLM as a Judge recipe (for comparison)**  
The original judge model uses `task: llm_judge`:

```
run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

### Input dataset format


The input dataset format is identical to the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/):

**Required fields:**
+ `prompt`: String containing the input prompt and instructions
+ `response_A`: String containing the baseline model output
+ `response_B`: String containing the customized model output

**Example dataset (JSONL format):**

```
{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}
```

**Format requirements:**
+ Each entry must be a single-line JSON object
+ Separate entries with newlines
+ Follow the exact field naming as shown in examples

### Evaluation output


**Output structure**  
Rubric Judge produces enhanced evaluation metrics compared to the original judge model:

```
{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}
```

**New metrics in Rubric Judge**  
The following six metrics are unique to Rubric Judge and provide granular quality assessment:


| Metric | Description | 
| --- |--- |
| weighted\$1score\$1A | Average normalized quality score for response\$1A across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1A\$1stderr | Standard error of the mean for weighted\$1score\$1A, indicating statistical uncertainty | 
| weighted\$1score\$1B | Average normalized quality score for response\$1B across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1B\$1stderr | Standard error of the mean for weighted\$1score\$1B, indicating statistical uncertainty | 
| score\$1margin | Difference between weighted scores (calculated as weighted\$1score\$1A - weighted\$1score\$1B). Range: -1.0 to 1.0. Positive = response\$1A is better; negative = response\$1B is better; near zero = similar quality | 
| score\$1margin\$1stderr | Standard error of the mean for score\$1margin, indicating uncertainty in the quality difference measurement | 

**Understanding weighted score metrics**  
**Purpose**: Weighted scores provide continuous quality measurements that complement binary preference verdicts, enabling deeper insights into model performance.

**Key differences from original judge**:
+ **Original judge**: Only outputs discrete preferences (A>B, B>A, A=B)
+ **Rubric Judge**: Outputs both preferences AND continuous quality scores (0-1 scale) based on custom criteria

**Interpreting score\$1margin**:
+ `score_margin = -0.128`: Response\$1B scored 12.8 percentage points higher than response\$1A
+ `|score_margin| < 0.1`: Narrow quality difference (close decision)
+ `|score_margin| > 0.2`: Clear quality difference (confident decision)

**Use cases**:
+ **Model improvement**: Identify specific areas where your model underperforms
+ **Quality quantification**: Measure the magnitude of performance gaps, not just win/loss ratios
+ **Confidence assessment**: Distinguish between close decisions and clear quality differences

**Important**  
Final verdicts are still based on the judge model's explicit preference labels to preserve holistic reasoning and ensure proper position bias mitigation through forward/backward evaluation. Weighted scores serve as observability tools, not as replacements for the primary verdict.

**Calculation methodology**  
Weighted scores are computed through the following process:
+ **Extract criterion data**: Parse the judge's YAML output to extract criterion scores and weights
+ **Normalize scores**:
  + Scale-type criteria (1-5): Normalize to 0-1 by calculating `(score - 1) / 4`
  + Binary criteria (true/false): Convert to 1.0/0.0
+ **Apply weights**: Multiply each normalized score by its criterion weight
+ **Aggregate**: Sum all weighted scores for each response
+ **Calculate margin**: Compute `score_margin = weighted_score_A - weighted_score_B`

**Example**: If response\$1A has a weighted sum of 0.65 and response\$1B has 0.78, the `score_margin` would be -0.13, indicating response\$1B is 13 percentage points higher in quality across all weighted criteria.

### Reasoning model support


Reasoning model support enables evaluation with reasoning-capable Amazon Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**:
+ amazon.nova-2-lite-v1:0:256k

**Recipe configuration**  
Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker  SageMaker HyperPod  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker  SageMaker HyperPod  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Using the reasoning\$1effort parameter**  
The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

**Prerequisites**:
+ **Model compatibility**: Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling**: Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

**Available options**:


| Option | Behavior | Token Limit | Use Case | 
| --- |--- |--- |--- |
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 

**When to enable reasoning**  
**Use reasoning mode (`low`, `medium`, or `high`) for**:
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (omit parameter) for**:
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

**Troubleshooting**  
**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# Starting an evaluation job


The following provides a suggested evaluation instance type and model type configuration:

```
# Install Dependencies (Helm - https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

# Install the SageMaker HyperPod CLI
git clone --recurse-submodules https://github.com/aws/sagemaker-hyperpod-cli.git
git checkout -b release_v2
cd sagemaker-hyperpod-cli
pip install .

# Verify the installation
hyperpod --help

# Connect to a SageMaker HyperPod Cluster
hyperpod connect-cluster --cluster-name cluster-name


# Submit the Job using the recipe for eval
# Namespace by default should be kubeflow
hyperpod start-job [--namespace namespace] --recipe evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval --override-parameters \
'{
    "instance_type":"p5d.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest",
    "recipes.run.name": custom-run-name,
    "recipes.run.model_type": model_type,
    "recipes.run.model_name_or_path" " model name or finetune checkpoint s3uri,
    "recipes.run.data_s3_path": s3 for input data only for genqa and llm_judge, must be full S3 path that include filename,
}'

# List jobs
hyperpod list-jobs [--namespace namespace] [--all-namespaces]

# Getting Job details
hyperpod get-job --job-name job-name [--namespace namespace] [--verbose]

# Listing Pods
hyperpod list-pods --job-name job-name --namespace namespace

# Cancel Job
hyperpod cancel-job --job-name job-name [--namespace namespace]
```

You should also be able to view the job status through Amazon EKS cluster console.

# Accessing and analyzing evaluation results


After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the `output_s3_path` (such as `s3://output_path/`) defined in the recipe, the output structure is the following:

```
job_name/
├── eval-result/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

Metrics results are stored in the specified S3 output location `s3://output_path/job_name/eval-result/result-timestamp.json`.

Tensorboard results are stored in the S3 path `s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip`.

All inference outputs, except for `llm_judge` and `strong_reject`, are stored in the S3 path: `s3://output_path/job_name/eval-result/details/model/taskname.parquet`.

For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
+ prompt - The final prompt submitted to the model
+ inference - The raw inference output from the model
+ gold - The target response from the input dataset
+ metadata - The metadata string from the input dataset if provided

To visualize your evaluation metrics in Tensorboard, complete the following steps:

1. Navigate to SageMaker AI Tensorboard.

1. Select **S3 folders**.

1. Add your S3 folder path, for example `s3://output_path/job-name/eval-tensorboard-result/eval`.

1. Wait for synchronization to complete.

The time series, scalars, and text visualizations are available.

We recommend the following best practices:
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

You can find SageMaker HyperPod job error logs in the CloudWatch log group `/aws/sagemaker/Clusters/cluster-id`.

## Log Probability Output Format


When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

# RFT evaluation


**Note**  
Evaluation via remote reward functions in your own AWS environment is only available if you are Amazon Nova Forge customer.

**Important**  
The `rl_env` configuration field is used exclusively for evaluation, not for training. During training, you configure reward functions using `reward_lambda_arn` (single-turn) or BYOO infrastructure with `rollout.delegate: true` (multi-turn).

**What is RFT Evaluation?**  
RFT Evaluation allows you to assess your model's performance using custom reward functions before, during, or after reinforcement learning training. Unlike standard evaluations that use pre-defined metrics, RFT Evaluation lets you define your own success criteria through a Lambda function that scores model outputs based on your specific requirements.

**Why Evaluate with RFT?**  
Evaluation is crucial to determine whether the RL fine-tuning process has:
+ Improved model alignment with your specific use case and human values
+ Maintained or improved model capabilities on key tasks
+ Avoided unintended side effects such as reduced factuality, increased verbosity, or degraded performance on other tasks
+ Met your custom success criteria as defined by your reward function

**When to Use RFT Evaluation**  
Use RFT Evaluation in these scenarios:
+ Before RFT Training: Establish baseline metrics on your evaluation dataset
+ During RFT Training: Monitor training progress with intermediate checkpoints
+ After RFT Training: Validate that the final model meets your requirements
+ Comparing Models: Evaluate multiple model versions using consistent reward criteria

**Note**  
Use RFT Evaluation when you need custom, domain-specific metrics. For general-purpose evaluation (accuracy, perplexity, BLEU), use standard evaluation methods.

**Topics**
+ [

## Data format requirements
](#nova-hp-evaluate-rft-data-format)
+ [

## Preparing your evaluation recipe
](#nova-hp-evaluate-rft-recipe)
+ [

## Preset reward functions
](#nova-hp-evaluate-rft-preset)
+ [

## Creating your reward function
](#nova-hp-evaluate-rft-create-function)
+ [

## IAM permissions
](#nova-hp-evaluate-rft-iam)
+ [

## Executing the evaluation job
](#nova-hp-evaluate-rft-execution)
+ [

## Understanding evaluation results
](#nova-hp-evaluate-rft-results)

## Data format requirements


**Input data structure**  
RFT evaluation input data must follow the OpenAI Reinforcement Fine-Tuning format. Each example is a JSON object containing:
+ `messages`: Array of conversational turns with `system` and `user` roles
+ Optional other metadata, e.g. reference\$1answer

**Data format example**  
The following example shows the required format:

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Solve for x. Return only JSON like {\"x\": <number>}. Equation: 2x + 5 = 13"
        }
      ]
    }
  ],
  "reference_answer": {
    "x": 4
  }
}
```

**Current limitations**  
The following limitations apply to RFT evaluation:
+ Text only: No multimodal inputs (images, audio, video) are supported
+ Single-turn conversations: Only supports single user message (no multi-turn dialogues)
+ JSON format: Input data must be in JSONL format (one JSON object per line)
+ Model outputs: Evaluation is performed on generated completions from the specified model

## Preparing your evaluation recipe


**Sample recipe configuration**  
The following example shows a complete RFT evaluation recipe:

```
run:
  name: nova-lite-rft-eval-job
  model_type: amazon.nova-lite-v1:0:300k
  model_name_or_path: s3://escrow_bucket/model_location    # [MODIFIABLE] S3 path to your model or model identifier
  replicas: 1                                             # [MODIFIABLE] For SageMaker Training jobs only; fixed for  SageMaker HyperPod  jobs
  data_s3_path: ""                                        # [REQUIRED FOR HYPERPOD] Leave empty for SageMaker Training jobs
  output_s3_path: ""                                      # [REQUIRED] Output artifact S3 path for evaluation results

evaluation:
  task: rft_eval                                          # [FIXED] Do not modify
  strategy: rft_eval                                      # [FIXED] Do not modify
  metric: all                                             # [FIXED] Do not modify

# Inference Configuration
inference:
  max_new_tokens: 8196                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                               # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                              # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                          # [MODIFIABLE] Sampling temperature (0 = deterministic)
  top_logprobs: 0

# Evaluation Environment Configuration (NOT used in training)
rl_env:
  reward_lambda_arn: arn:aws:lambda:<region>:<account_id>:function:<reward-function-name>
```

## Preset reward functions


Two preset reward functions (`prime_code` and `prime_math`) are available as a Lambda layer for easy integration with your RFT Lambda functions.

**Overview**  
These preset functions provide out-of-the-box evaluation capabilities for:
+ **prime\$1code**: Code generation and correctness evaluation
+ **prime\$1math**: Mathematical reasoning and problem-solving evaluation

**Quick setup**  
To use preset reward functions:

1. Download the Lambda layer from the [nova-custom-eval-sdk releases](https://github.com/aws/nova-custom-eval-sdk/releases)

1. Publish Lambda layer using AWS CLI:

   ```
   aws lambda publish-layer-version \
       --layer-name preset-function-layer \
       --description "Preset reward function layer with dependencies" \
       --zip-file fileb://universal_reward_layer.zip \
       --compatible-runtimes python3.9 python3.10 python3.11 python3.12 \
       --compatible-architectures x86_64 arm64
   ```

1. Add the layer to your Lambda function in AWS Console (Select the preset-function-layer from custom layer and also add AWSSDKPandas-Python312 for numpy dependencies)

1. Import and use in your Lambda code:

   ```
   from prime_code import compute_score  # For code evaluation
   from prime_math import compute_score  # For math evaluation
   ```

**prime\$1code function**  
**Purpose**: Evaluates Python code generation tasks by executing code against test cases and measuring correctness.

**Example input dataset format from evaluation**:

```
{"messages":[{"role":"user","content":"Write a function that returns the sum of two numbers."}],"reference_answer":{"inputs":["3\n5","10\n-2","0\n0"],"outputs":["8","8","0"]}}
{"messages":[{"role":"user","content":"Write a function to check if a number is even."}],"reference_answer":{"inputs":["4","7","0","-2"],"outputs":["True","False","True","True"]}}
```

**Key features**:
+ Automatic code extraction from markdown code blocks
+ Function detection and call-based testing
+ Test case execution with timeout protection
+ Syntax validation and compilation checks
+ Detailed error reporting with tracebacks

**prime\$1math function**  
**Purpose**: Evaluates mathematical reasoning and problem-solving capabilities with symbolic math support.

**Input format**:

```
{"messages":[{"role":"user","content":"What is the derivative of x^2 + 3x?."}],"reference_answer":"2*x + 3"}
```

**Key features**:
+ Symbolic math evaluation using SymPy
+ Multiple answer formats (LaTeX, plain text, symbolic)
+ Mathematical equivalence checking
+ Expression normalization and simplification

**Best practices**  
Follow these best practices when using preset reward functions:
+ Use proper data types in test cases (integers vs strings, booleans vs "True")
+ Provide clear function signatures in code problems
+ Include edge cases in test inputs (zero, negative numbers, empty inputs)
+ Format math expressions consistently in reference answers
+ Test your reward function with sample data before deployment

## Creating your reward function


**Lambda ARN**  
You must refer to the following format for the Lambda ARN:

```
"arn:aws:lambda:*:*:function:*SageMaker*"
```

If the Lambda does not have this naming scheme, the job will fail with this error:

```
[ERROR] Unexpected error: lambda_arn must contain one of: ['SageMaker', 'sagemaker', 'Sagemaker'] when running on SMHP platform (Key: lambda_arn)
```

**Lambda function structure**  
Your Lambda function receives batches of model outputs and returns reward scores. Below is a sample implementation:

```
from typing import List, Any
import json
import re
from dataclasses import asdict, dataclass


@dataclass
class MetricResult:
    """Individual metric result."""
    name: str
    value: float
    type: str


@dataclass
class RewardOutput:
    """Reward service output."""
    id: str
    aggregate_reward_score: float
    metrics_list: List[MetricResult]


def lambda_handler(event, context):
    """ Main lambda handler """
    return lambda_grader(event)


def lambda_grader(samples: list[dict]) -> list[dict]:
    """ Core grader function """
    scores: List[RewardOutput] = []

    for sample in samples:
        print("Sample: ", json.dumps(sample, indent=2))

        # Extract components
        idx = sample.get("id", "no id")
        if not idx or idx == "no id":
            print(f"ID is None/empty for sample: {sample}")

        ground_truth = sample.get("reference_answer")

        if "messages" not in sample:
            print(f"Messages is None/empty for id: {idx}")
            continue

        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            continue

        # Get model's response (last turn is assistant turn)
        last_message = sample["messages"][-1]

        if last_message["role"] != "nova_assistant":
            print(f"Last message is not from assistant for id: {idx}")
            continue

        if "content" not in last_message:
            print(f"Completion text is empty for id: {idx}")
            continue

        model_text = last_message["content"]

        # --- Actual scoring logic (lexical overlap) ---
        ground_truth_text = _extract_ground_truth_text(ground_truth)

        # Calculate main score and individual metrics
        overlap_score = _lexical_overlap_score(model_text, ground_truth_text)

        # Create two separate metrics as in the first implementation
        accuracy_score = overlap_score  # Use overlap as accuracy
        fluency_score = _calculate_fluency(model_text)  # New function for fluency

        # Create individual metrics
        metrics_list = [
            MetricResult(name="accuracy", value=accuracy_score, type="Metric"),
            MetricResult(name="fluency", value=fluency_score, type="Reward")
        ]

        ro = RewardOutput(
            id=idx,
            aggregate_reward_score=overlap_score,
            metrics_list=metrics_list
        )

        print(f"Response for id: {idx} is {ro}")
        scores.append(ro)

    # Convert to dict format
    result = []
    for score in scores:
        result.append({
            "id": score.id,
            "aggregate_reward_score": score.aggregate_reward_score,
            "metrics_list": [asdict(metric) for metric in score.metrics_list]
        })

    return result


def _extract_ground_truth_text(ground_truth: Any) -> str:
    """
    Turn the `ground_truth` field into a plain string.
    """
    if isinstance(ground_truth, str):
        return ground_truth

    if isinstance(ground_truth, dict):
        # Common patterns: { "explanation": "...", "answer": "..." }
        if "explanation" in ground_truth and isinstance(ground_truth["explanation"], str):
            return ground_truth["explanation"]
        if "answer" in ground_truth and isinstance(ground_truth["answer"], str):
            return ground_truth["answer"]
        # Fallback: stringify the whole dict
        return json.dumps(ground_truth, ensure_ascii=False)

    # Fallback: stringify anything else
    return str(ground_truth)


def _tokenize(text: str) -> List[str]:
    # Very simple tokenizer: lowercase + alphanumeric word chunks
    return re.findall(r"\w+", text.lower())


def _lexical_overlap_score(model_text: str, ground_truth_text: str) -> float:
    """
    Simple lexical overlap score in [0, 1]:
      score = |tokens(model) ∩ tokens(gt)| / |tokens(gt)|
    """
    gt_tokens = _tokenize(ground_truth_text)
    model_tokens = _tokenize(model_text)

    if not gt_tokens:
        return 0.0

    gt_set = set(gt_tokens)
    model_set = set(model_tokens)
    common = gt_set & model_set

    return len(common) / len(gt_set)


def _calculate_fluency(text: str) -> float:
    """
    Calculate a simple fluency score based on:
    - Average word length
    - Text length
    - Sentence structure

    Returns a score between 0 and 1.
    """
    # Simple implementation - could be enhanced with more sophisticated NLP
    words = _tokenize(text)

    if not words:
        return 0.0

    # Average word length normalized to [0,1] range
    # Assumption: average English word is ~5 chars, so normalize around that
    avg_word_len = sum(len(word) for word in words) / len(words)
    word_len_score = min(avg_word_len / 10, 1.0)

    # Text length score - favor reasonable length responses
    ideal_length = 100  # words
    length_score = min(len(words) / ideal_length, 1.0)

    # Simple sentence structure check (periods, question marks, etc.)
    sentence_count = len(re.findall(r'[.!?]+', text)) + 1
    sentence_ratio = min(sentence_count / (len(words) / 15), 1.0)

    # Combine scores
    fluency_score = (word_len_score + length_score + sentence_ratio) / 3

    return fluency_score
```

**Lambda request format**  
Your Lambda function receives data in this format:

```
[
  {
    "id": "sample-001",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Do you have a dedicated security team?"
          }
        ]
      },
      {
        "role": "nova_assistant",
        "content": [
          {
            "type": "text",
            "text": "As an AI developed by Company, I don't have a dedicated security team in the traditional sense. However, the development and deployment of AI systems like me involve extensive security measures, including data encryption, user privacy protection, and other safeguards to ensure safe and responsible use."
          }
        ]
      }
    ],
    "reference_answer": {
      "compliant": "No",
      "explanation": "As an AI developed by Company, I do not have a traditional security team. However, the deployment involves stringent safety measures, such as encryption and privacy safeguards."
    }
  }
]
```

**Note**  
The message structure includes the nested `content` array, matching the input data format. The last message with role `nova_assistant` contains the model's generated response.

**Lambda response format**  
Your Lambda function must return data in this format:

```
[
  {
    "id": "sample-001",
    "aggregate_reward_score": 0.75,
    "metrics_list": [
      {
        "name": "accuracy",
        "value": 0.85,
        "type": "Metric"
      },
      {
        "name": "fluency",
        "value": 0.90,
        "type": "Reward"
      }
    ]
  }
]
```

**Response fields**:
+ `id`: Must match the input sample ID
+ `aggregate_reward_score`: Overall score (typically 0.0 to 1.0)
+ `metrics_list`: Array of individual metrics with:
  + `name`: Metric identifier (e.g., "accuracy", "fluency")
  + `value`: Metric score (typically 0.0 to 1.0)
  + `type`: Either "Metric" (for reporting) or "Reward" (used in training)

## IAM permissions


**Required permissions**  
Your SageMaker AI execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker AI execution role:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction"
      ],
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"
    }
  ]
}
```

**Lambda execution role**  
Your Lambda function's execution role needs basic Lambda execution permissions:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

**Additional permissions**: If your Lambda function accesses other AWS services (e.g., Amazon S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Executing the evaluation job


1. **Prepare your data**
   + Format your evaluation data according to the data format requirements
   + Upload your JSONL file to Amazon S3: `s3://your-bucket/eval-data/eval_data.jsonl`

1. **Configure your recipe**

   Update the sample recipe with your configuration:
   + Set `model_name_or_path` to your model location
   + Set `lambda_arn` to your reward function ARN
   + Set `output_s3_path` to your desired output location
   + Adjust `inference` parameters as needed

   Save the recipe as `rft_eval_recipe.yaml`

1. **Run the evaluation**

   Execute the evaluation job using the provided notebook: [Nova model evaluation notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook)

1. **Monitor progress**

   Monitor your evaluation job through:
   + SageMaker AI Console: Check job status and logs
   + CloudWatch Logs: View detailed execution logs
   + Lambda Logs: Debug reward function issues

## Understanding evaluation results


**Output format**  
The evaluation job outputs results to your specified Amazon S3 location in JSONL format. Each line contains the evaluation results for one sample:

```
{
  "id": "sample-001",
  "aggregate_reward_score": 0.75,
  "metrics_list": [
    {
      "name": "accuracy",
      "value": 0.85,
      "type": "Metric"
    },
    {
      "name": "fluency",
      "value": 0.90,
      "type": "Reward"
    }
  ]
}
```

**Note**  
The RFT Evaluation Job Output is identical to the Lambda Response format. The evaluation service passes through your Lambda function's response without modification, ensuring consistency between your reward calculations and the final results.

**Interpreting results**  
**Aggregate Reward Score**:
+ Range: Typically 0.0 (worst) to 1.0 (best), but depends on your implementation
+ Purpose: Single number summarizing overall performance
+ Usage: Compare models, track improvement over training

**Individual Metrics**:
+ Metric Type: Informational metrics for analysis
+ Reward Type: Metrics used during RFT training
+ Interpretation: Higher values generally indicate better performance (unless you design inverse metrics)

**Performance benchmarks**  
What constitutes "good" performance depends on your use case:


| Score Range | Interpretation | Action | 
| --- |--- |--- |
| 0.8 - 1.0 | Excellent | Model ready for deployment | 
| 0.6 - 0.8 | Good | Minor improvements may be beneficial | 
| 0.4 - 0.6 | Fair | Significant improvement needed | 
| 0.0 - 0.4 | Poor | Review training data and reward function | 

**Important**  
These are general guidelines. Define your own thresholds based on business requirements, baseline model performance, domain-specific constraints, and cost-benefit analysis of further training.

# Monitoring HyperPod jobs with MLflow
MLflow monitoring

You can use MLflow to track and monitor your training jobs on SageMaker HyperPod. Follow these steps to set up MLflow and connect it to your training recipes.

***Create the MLflow App***

Example AWS CLI command

```
aws sagemaker-mlflow create-mlflow-app \
    --name <app-name> \
    --artifact-store-uri <s3-bucket-name> \
    --role-arn <role-arn> \
    --region <region-name>
```

Example output

```
{
    "Arn": "arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ"
}
```

***Generate pre-signed URL***

Example AWS CLI command

```
aws sagemaker-mlflow create-presigned-mlflow-app-url \
    --arn <app-arn> \
    --region <region-name> \
    --output text
```

Example output

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

***Open presigned URL and view the app***

Click 

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

View 

![\[Example Amazon Nova image.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/screenshot-nova-model-1.png)


***Pass to recipe under run block of your SageMaker HyperPod recipe***

Recipe

```
run
    mlflow_tracking_uri: arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ
```

View

![\[Example Amazon Nova image.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/screenshot-nova-model-2.png)