

# Customizing Amazon Nova models on SageMaker AI
Amazon Nova model customization

You can customize Amazon Nova models, including the enhanced Amazon Nova 2.0 models, through [recipes](nova-model-recipes.md#nova-model-get-recipes) and train them on SageMaker. These recipes support techniques such as supervised fine-tuning (SFT) and Reinforcement Fine-Tuning (RFT), with both full-rank and low-rank adaptation (LoRA) options.

The end-to-end customization workflow involves stages like model training, model evaluation, and deployment for inference. This model customization approach on SageMaker provides greater flexibility and control to fine-tune its supported Amazon Nova models, optimize hyperparameters with precision, and implement techniques such as LoRA parameter-efficient fine-tuning (PEFT), full-rank SFT, RFT, and Continued Pre-Training (CPT).

## Customization approaches


SageMaker offers two approaches for customizing Amazon Nova models:

**UI-based experience** – Use to customize Amazon Nova models through a simple, guided interface. This approach provides an end-to-end workflow including training, evaluation, and deployment without writing code. The UI-based experience is ideal for rapid experimentation, proof-of-concept development, and users who prefer a visual workflow.

**Code-based experience** – Use the SageMaker Python SDK, Nova SDK and training recipes to customize models programmatically. This approach offers greater flexibility, allowing you to configure advanced hyperparameters, integrate with CI/CD pipelines, and automate training workflows. The code-based experience is recommended for production workloads, complex customization requirements, and teams with established MLOps practices.


****  

| Approach | Best for | Key benefits | 
| --- | --- | --- | 
| UI-based | Experimentation, prototyping, quick iterations | Simple setup, guided workflow, no coding required | 
| Code-based | Production, automation, advanced configurations | Full flexibility, pipeline integration, version control | 

## Customization platforms


AWS offers three platforms for customizing Amazon Nova models, each designed for different use cases and requirements:

**Amazon Bedrock** – Provides the easiest and fastest path to model customization with minimal setup. Bedrock handles all infrastructure management automatically, allowing you to focus on your data and use case. This platform is ideal when you need the quickest time-to-value and prefer a fully managed experience.

**SageMaker training jobs** – Provides a fully managed environment for customizing Amazon Nova models where you don't need to create or maintain any clusters. The service automatically handles all infrastructure provisioning, scaling, and resource management, allowing you to focus on configuring your training parameters and submitting your job. This platform offers a balance between ease of use and flexibility, supporting techniques like Parameter Efficient Fine-tuning (PEFT), Full rank fine tuning, and Reinforcement Fine-Tuning (RFT).

**SageMaker HyperPod** – Offers a specialized environment for large-scale distributed training by requiring you to create and manage EKS clusters with restricted instance groups (RIGs). This platform gives you maximum flexibility in configuring your training environment with specialized GPU instances and integrated Amazon FSx for Lustre storage, making it particularly well-suited for advanced distributed training scenarios, ongoing model development, and enterprise-scale customization workloads.


****  

| Platform | Complexity | Flexibility | Best for | 
| --- | --- | --- | --- | 
| Amazon Bedrock | Lowest | Standard | Fastest customization, minimal setup | 
| SageMaker training jobs | Medium | High | Balanced flexibility and ease of use | 
| SageMaker HyperPod | Highest | Maximum | Large distributed training, enterprise workloads | 

**Note**  
If you provide a KMS key to your Amazon Nova model customization training job for encryption in the Amazon-owned output S3 bucket:  
You must provide the same KMS key when calling subsequent iterative training jobs, or when calling the Amazon Bedrock CreateCustomModel API leveraging the encrypted model.
The identity calling the `CreateTrainingJob` API (rather than the execution role) must have permissions to `CreateGrant`, `RetireGrant`, `Encrypt`, and `GenerateDataKey` as defined in KMS key policy.

# General prerequisites
General prerequisites

The customization process involves several key stages including model training, evaluation, and deployment for inference, each requiring specific resources and configurations. Before beginning your Amazon Nova model customization on SageMaker, ensure you have the following general prerequisites.
+ An AWS account. If you don't have an AWS account, follow [these instructions](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-set-up.html#sign-up-for-aws) to sign up for one.
+ [Installing the AWS CLI](https://docs.aws.amazon.com//cli/latest/userguide/getting-started-install.html) and [Setting up the configuration](https://docs.aws.amazon.com//cli/latest/userguide/getting-started-quickstart.html).
+ Access to the [base Amazon Nova model customization recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/).
+ Familiarity of [YAML](https://yaml.org/) configuration files.
+ Familiarity of how to run a notebook in your environment.
+ Familiarity of how to create AWS resources like [Amazon S3 buckets](https://docs.aws.amazon.com//AmazonS3/latest/userguide/create-bucket-overview.html) and [IAM roles with appropriate permissions](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create.html).
+ Familiarity of how to [train a model with SageMaker](https://docs.aws.amazon.com//sagemaker/latest/dg/how-it-works-training.html).
+ Familiarity of [SageMaker HyperPod with EKS orchestration](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-eks.html).
+ Familiarity of [SageMaker HyperPod CLI](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-eks-run-jobs-access-nodes.html).
+ Familiarity of [Amazon Nova foundational models](https://docs.aws.amazon.com//nova/latest/userguide/customization.html).
+ Familiarity of [available Amazon Nova models and algorithms for customization](nova-model-recipes.md#nova-model-algorithm).
+ Familiarity of [Amazon Bedrock inference](https://docs.aws.amazon.com/=/bedrock/latest/userguide/inference-how.html).

# Amazon Nova recipes
Amazon Nova recipes

You can get an Amazon Nova recipe from the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes) repository. An Amazon Nova recipe is a [YAML](https://yaml.org/) configuration file that provides details to SageMaker on how to run your model customization job. It provides the base model name, sets training hyperparameters, defines optimization settings, and includes any additional options required to fine-tune or train the model successfully.

You can also access Amazon Nova recipes through SageMaker AI Monarch and by navigating to the model hub, selecting AWS, and browsing Amazon Nova models to find their associated recipes. Both SageMaker AI Monarch and provide sample notebooks for each recipe, which include all the necessary steps to modify the recipes and run customization jobs using SageMaker training jobs or SageMaker HyperPod environments.

To access the recipes pages in Amazon SageMaker Studio, the execution role must have the following permissions.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::*model-customization-recipes*"
            ]
        }
    ]
}
```

------

To execute the sample notebooks on SageMaker training jobs and , use one of the following SageMaker distribution image versions: `2.7.1+`, `2.8.0+`, `3.2.1+`, `3.3.0+`. This applies to both SageMaker AI Monarch and .

**Topics**
+ [

## Getting Amazon Nova recipes
](#nova-model-get-recipes)
+ [

## Getting Amazon Nova Forge recipes
](#nova-model-get-forge-recipes)
+ [

## Available models and algorithms
](#nova-model-algorithm)
+ [

## Amazon Nova 2.0 Lite
](#nova-model-recipes-reference-novalite2)

## Getting Amazon Nova recipes
Getting base recipes

To get a base Amazon Nova recipe, clone the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes) repository by running the following command.

```
git clone https://github.com/aws/sagemaker-hyperpod-recipes.git
```

The base recipes are available at [https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/).

```
cd recipes_collection/recipes/
```

The Amazon Nova customization recipes are in the following folders.


****  

| Recipe type | Folder | 
| --- | --- | 
| SFT (Full-rank and PEFT) | [fine-tuning/nova](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova) | 
| Evaluation | [evaluation/nova](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/evaluation/nova) | 
| CPT | [training/nova](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training) | 

## Getting Amazon Nova Forge recipes
Getting Forge recipes

To access specialized Amazon Nova Forge recipes for jobs, please [set up your access to Amazon Nova Forge](nova-forge-access.md) and then follow [this workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US/02-smhp-rig/03-nova-forge) to set up the Forge version of the CLI. For SageMaker Training Jobs, the AWS Console and Nova Forge SDK will automatically access Forge recipes.

## Available models and algorithms


The following table summarizes the availability of customization for Amazon Nova models and supported algorithms using SageMaker.


| Model name | Model ID | Fine-tuning | Notes | 
| --- |--- |--- |--- |
| Amazon Nova Lite 2.0 | amazon.nova-2-lite-v1:0:256k | Yes |    SFT - Accepts either text and/or image or text and/or video as the input and produces text as output. A single job can't combine text, image, and video in the same run.   RFT - Accepts text and image as input for single-turn conversations and produces text as output. Improved reward optimization capabilities.   CPT    | 

## Amazon Nova 2.0 Lite


The table below lists detailed information of the Amazon Nova 2.0 Lite recipes reference.


****  

| Model | Category/Sub-category | Technique | Recipe Name | Image URI (Training Jobs) | Image URI (SageMaker HyperPod) | Compute Instance | 
| --- | --- | --- | --- | --- | --- | --- | 
| Nova 2.0 Lite | Training | Continued Pre Training | nova\$1lite\$12\$10\$1p5x8\$1gpu\$1pretrain.yaml | N/A | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-CPT-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | Parameter Efficient Fine Tuning | Fine Tuning | nova\$1lite\$12\$10\$1p5\$1gpu\$1lora\$1sft.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-SFT-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | Full-Rank Fine Tuning | Fine Tuning | nova\$1lite\$12\$10\$1p5\$1gpu\$1sft.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-SFT-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | Parameter Efficient RFT | Fine Tuning | nova\$1lite\$1v2\$1smtj\$1p5\$1p5en\$1gpu\$1lora\$1rft.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-RFT-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-RFT-TRAIN-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | Full Rank RFT | Fine Tuning | nova\$1lite\$1v2\$1smtj\$1p5en\$1gpu\$1rft.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-RFT-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-RFT-TRAIN-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | Bring Your Own Data | Evaluation | nova\$1lite\$12\$10\$1p5\$148xl\$1gpu\$1bring\$1your\$1own\$1dataset\$1eval.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | General Text Benchmark | Evaluation | nova\$1lite\$12\$10\$1p5\$148xl\$1gpu\$1general\$1text\$1benchmark\$1eval.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | RFT Evaluation | Evaluation | nova\$1lite\$12\$10\$1p5\$148xl\$1gpu\$1rft\$1eval.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 
| Nova 2.0 Lite | LLM Rubric Based Judge | Evaluation | nova\$1lite\$12\$10\$1p5\$148xl\$1gpu\$1rubric\$1llm\$1judge\$1eval.yaml | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest | ml.p5.48xlarge, ml.p5en.48xlarge | 

# Amazon Nova customization on SageMaker Training Jobs
On SageMaker training jobs

SageMaker Training Jobs is an environment that enables you to train machine learning models at scale. It automatically provisions and scales compute resources, loads training data from sources like Amazon S3, executes your training code, and stores the resulting model artifacts.

The purpose of training is to customize the base Amazon Nova model using your proprietary data. The training process typically involves steps to prepare your data, choose a [recipe](nova-model-recipes.md), modify configuration parameters in YAML files, and submit a training job. The training process will output trained model checkpoint in a service-managed Amazon S3 bucket. You can use this checkpoint location for evaluation jobs. Nova customization on SageMaker AI training jobs stores model artifacts in a service-managed Amazon S3 bucket. Artifacts in the service-managed bucket are encrypted with SageMaker AI-managed KMS keys. Service-managed Amazon S3 buckets don't currently support data encryption using customer-managed KMS keys.

## Best Practices for Amazon Nova customization


### Overview


This section provides an overview of customization techniques and helps you choose the best approach for your needs and available data.

#### Two stages of LLM training


Large language model training consists of two major stages: pre-training and post-training. During pre-training, the model processes tokens of raw text and optimizes for next-token prediction. This process creates a pattern completer that absorbs syntax, semantics, facts, and reasoning patterns from web and curated text. However, the pre-trained model doesn't understand instructions, user goals, or context-appropriate behavior. It continues text in whatever style fits its training distribution. A pre-trained model autocompletes rather than follows directions, produces inconsistent formatting, and can mirror undesirable biases or unsafe content from the training data. Pre-training builds general competence, not task usefulness.

Post-training transforms the pattern completer into a useful assistant. You run multiple rounds of Supervised Fine-Tuning (SFT) to teach the model to follow instructions, adhere to schemas and policies, call tools, and produce reliable outputs by imitating high-quality demonstrations. This alignment teaches the model to respond to prompts as tasks rather than text to continue. You then apply Reinforcement Fine-Tuning (RFT) to optimize behavior using measurable feedback (such as verifiers or an LLM-as-a-judge), balancing trade-offs like accuracy versus brevity, safety versus coverage, or multi-step reasoning under constraints. In practice, you alternate SFT and RFT in cycles to shape the pre-trained model into a reliable, policy-aligned system that performs complex tasks consistently.

### Choose the right customization approach


In this section we will cover post training customization strategies: RFT and SFT.

#### Reinforcement fine-tuning (RFT)


Reinforcement fine-tuning improves model performance through feedback signals—measurable scores or rewards that indicate response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards. This approach works well for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications that require nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that you can programmatically evaluate. For example, answering complex legal questions is an ideal use case for RFT because you want to teach the model how to reason better to answer questions more accurately.

##### How it works


In reinforcement fine-tuning, you start from an instruction-tuned baseline and treat each prompt like a small tournament. For a given input, you sample a handful of candidate answers from the model, score each one with the reward function, then rank them within that group. The update step nudges the model to make higher-scoring candidates more likely next time and lower-scoring ones less likely, while a stay-close-to-baseline constraint keeps behavior from drifting or becoming verbose or exploitative. You repeat this loop over many prompts, refreshing hard cases, tightening verifiers or judge rubrics when you see exploits, and continuously tracking task metrics.

##### When to use RFT


Tasks that benefit most from RFT share several traits. They have measurable success signals even when a single correct output is hard to specify. They admit partial credit or graded quality so you can rank better versus worse answers within a prompt or using a reward function. They involve multiple objectives that must be balanced (such as accuracy with brevity, clarity, safety, or cost). They require adherence to explicit constraints that you can programmatically check. They operate in tool-mediated or environment-based settings where outcomes are observable (success or failure, latency, resource use). They occur in low-label regimes where collecting gold targets is expensive but automated or rubric-based feedback is plentiful. RFT works best when you can turn quality into a reliable scalar or ranking and want the model to preferentially amplify higher-scoring behaviors without needing exhaustive labeled targets.

**Consider other methods when:**
+ You have plentiful, reliable labeled input-output pairs – Use SFT
+ The main gap is knowledge or jargon – Use retrieval-augmented generation (RAG)
+ Your reward signal is noisy or unreliable and you can't fix it with better rubrics or checkers – Stabilize that first before RFT

##### When not to use RFT


Avoid RFT in these situations:
+ You can cheaply produce reliable labeled input-output pairs (SFT is simpler, cheaper, and more stable)
+ The gap is knowledge or jargon rather than behavior (use RAG)
+ Your reward signal is noisy, sparse, easy to game, or expensive or slow to compute (fix the evaluator first)
+ Baseline performance is near-zero (bootstrap with SFT before optimizing preferences)
+ The task has deterministic schemas, strict formatting, or a single correct answer (SFT or rule-based validation works better)
+ Tight latency or cost budgets can't absorb the extra sampling or exploration RFT requires
+ Safety or policy constraints aren't crisply specified and enforceable in the reward

If you can point to "the right answer," use SFT. If you need new knowledge, use RAG. Use RFT only after you have a solid baseline and a robust, fast, hard-to-exploit reward function.

#### Supervised fine-tuning (SFT)


Supervised fine-tuning trains the LLM on a dataset of human-labeled input-output pairs for your task. You provide examples of prompts (questions, instructions, and so on) with the correct or desired responses, and continue training the model on these examples. The model adjusts its weights to minimize a supervised loss (typically cross-entropy between its predictions and the target output tokens). This is the same training used in most supervised machine learning tasks, applied to specialize an LLM.

SFT changes behavior, not knowledge. It doesn't teach the model new facts or jargon it didn't see in pre-training. It teaches the model how to answer, not what to know. If you need new domain knowledge (such as internal terminology), use retrieval-augmented generation (RAG) to provide that context at inference time. SFT then adds the desired instruction-following behavior on top.

##### How it works


SFT optimizes LLM by minimizing the average cross-entropy loss on response tokens, treating prompt tokens as context and masking them from the loss. The model internalizes your target style, structure, and decision rules, learning to generate the correct completion for each prompt. For example, to classify documents into custom categories, you fine-tune the model with prompts (the document text) and labeled completions (the category labels). You train on those pairs until the model outputs the right label for each prompt with high probability.

You can perform SFT with as few as a few hundred examples and scale up to a few hundred thousand. SFT samples must be high quality and directly aligned with the desired model behavior.

##### When to use SFT


Use SFT when you have a well-defined task with clear desired outputs. If you can explicitly state "Given X input, the correct output is Y" and gather examples of such mappings, supervised fine-tuning is a good choice. SFT excels in these scenarios:
+ **Structured or complex classification tasks** – Classify internal documents or contracts into many custom categories. With SFT, the model learns these specific categories better than prompting alone.
+ **Question-answering or transformation tasks with known answers** – Fine-tune a model to answer questions from a company's knowledge base, or convert data between formats where each input has a correct response.
+ **Formatting and style consistency** – Train the model to always respond in a certain format or tone by fine-tuning on examples of the correct format or tone. For instance, training on prompt-response pairs that demonstrate a particular brand voice teaches the model to generate outputs with that style. Instruction-following behavior is often initially taught through SFT on curated examples of good assistant behavior.

SFT is the most direct way to teach an LLM a new skill or behavior when you can specify what the right behavior looks like. It uses the model's existing language understanding and focuses it on your task. Use SFT when you want the model to do a specific thing and you have or can create a dataset of examples.

Use SFT when you can assemble high-quality prompt and response pairs that closely mirror the behavior you want. It fits tasks with clear targets or deterministic formats such as schemas, function or tool calls, and structured answers where imitation is an appropriate training signal. The goal is behavior shaping: teaching the model to treat prompts as tasks, follow instructions, adopt tone and refusal policies, and produce consistent formatting. Plan for at least hundreds of demonstrations, with data quality, consistency, and deduplication mattering more than raw volume. For a straightforward, cost-efficient update, use parameter-efficient methods like Low-Rank Adaptation to train small adapters while leaving most of the backbone untouched.

##### When not to use SFT


Don't use SFT when the gap is knowledge rather than behavior. It doesn't teach the model new facts, jargon, or recent events. In those cases, use retrieval-augmented generation to bring external knowledge at inference. Avoid SFT when you can measure quality but can't label a single right answer. Use reinforcement fine-tuning with verifiable rewards or an LLM-as-a-judge to optimize those rewards directly. If your needs or content change frequently, rely on retrieval and tool use rather than retraining the model.

**Topics**
+ [

## Best Practices for Amazon Nova customization
](#best-practices)
+ [

# Nova Forge SDK
](nova-forge-sdk.md)
+ [

# Training for Amazon Nova models
](smtj-training.md)
+ [

# Monitoring Progress Across Iterations
](nova-model-monitor.md)
+ [

# Evaluating your SageMaker AI-trained model
](nova-model-evaluation.md)
+ [

# Iterative training
](smtj-iterative-training.md)

# Nova Forge SDK
Nova Forge SDK

The Nova Forge SDK is a comprehensive Python SDK for customizing Amazon Nova models. The SDK provides a unified interface for training, evaluation, monitoring, deployment, and inference of Amazon Nova models across different platforms including SageMaker AI and Amazon Bedrock. Whether you're adapting models to domain-specific tasks or optimizing performance for your use case, this SDK provides everything you need in one unified interface.

## Benefits

+ One SDK for the entire model customization lifecycle—from data preparation to deployment and monitoring.
+ Support for multiple training methods including continued pre-training (CPT), supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement fine-tuning (RFT), both single-turn and multi-turn, with both LoRA and full-rank approaches.
+ Built-in support for SageMaker Training Jobs, SageMaker HyperPod, and Amazon Bedrock, with automatic resource management.
+ No more finding the right recipes or container URI for your training techniques.
+ Bring your own training recipes or use the SDK's intelligent defaults with parameter overrides.
+ The SDK validates your configuration against supported model and instance combinations and provides validation support, preventing errors before training starts.
+ Integrated Amazon CloudWatch monitoring enables you to track training progress in real-time.
+ Integrated MLFlow to track training experiments with SageMaker AI MLFlow tracking servers.

## Requirements


The SDK requires at least Python 3.12.

## Installation


To install this SDK, please follow below command.

```
pip install amzn-nova-forge
```

## Supported Models and Techniques


The SDK supports the following models and techniques within the Amazon Nova family:


****  

| Method | Supported Models | 
| --- | --- | 
| Continued Pre-training | [All Nova Models](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-recipes.html#nova-model-recipes-reference) (SMHP only) | 
| Supervised Fine-tuning LoRA | [All Nova Models](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-recipes.html#nova-model-recipes-reference) | 
| Supervised Fine-tuning Full-Rank | [All Nova Models](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-recipes.html#nova-model-recipes-reference) (SMHP and SMTJ only) | 
| Direct Preference Optimization LoRA | Nova 1.0 models (SMHP and SMTJ only) | 
| Direct Preference Optimization Full-Rank | Nova 1.0 models (SMHP and SMTJ only) | 
| Reinforcement Fine-tuning LoRA | Nova Lite 2.0 | 
| Reinforcement Fine-tuning Full-Rank | Nova Lite 2.0 (SMHP and SMTJ only) | 
| Multi-turn Reinforcement Fine-tuning LoRA | Nova Lite 2.0 (SMHP Only) | 
| Multi-turn Reinforcement Fine-tuning Full-Rank | Nova Lite 2.0 (SMHP Only) | 

## Getting Started


**Topics**
+ [

### 1. Prepare Your Data
](#nova-forge-sdk-prepare-data)
+ [

### 2. Configure Your Infrastructure
](#nova-forge-sdk-configure-infrastructure)
+ [

### 3. Train
](#nova-forge-sdk-train)
+ [

### 4. Monitor
](#nova-forge-sdk-monitor)
+ [

### 5. Evaluate
](#nova-forge-sdk-evaluate)
+ [

### 6. Deploy
](#nova-forge-sdk-deploy)

### 1. Prepare Your Data


Load your dataset from local files or S3, and let the SDK handle the transformation to the correct format for your chosen training method. Or, provide formatted data and get started immediately.

```
from amzn_nova_forge.dataset.dataset_loader import JSONLDatasetLoader
from amzn_nova_forge.model.model_enums import Model, TrainingMethod, TransformMethod

loader = JSONLDatasetLoader()
loader.load("s3://your-bucket/training-data.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "input", "answer": "output"},
)
```

### 2. Configure Your Infrastructure


Choose your compute resources—the SDK validates configurations and ensures optimal setup.

```
from amzn_nova_forge.manager.runtime_manager import BedrockRuntimeManager, SMTJRuntimeManager, SMHPRuntimeManager
# Bedrock
runtime = BedrockRuntimeManager(
execution_role="arn:aws:iam::123456789012:role/ExampleRole"
)

# SageMaker Training Jobs
runtime = SMTJRuntimeManager(
    instance_type="ml.p5.48xlarge",
    instance_count=4
)

# SageMaker HyperPod
runtime = SMHPRuntimeManager(
    instance_type="ml.p5.48xlarge",
    instance_count=4,
    cluster_name="my-hyperpod-cluster",
    namespace="kubeflow"
)
```

### 3. Train


Start training with just a few lines of code.

```
from amzn_nova_forge.model import NovaModelCustomizer
from amzn_nova_forge.model.model_enums import Model, TrainingMethod

customizer = NovaModelCustomizer(
    model=Model.NOVA_LITE_2,
    method=TrainingMethod.SFT_LORA,
    infra=runtime,
    data_s3_path="s3://your-bucket/prepared-data.jsonl"
)

result = customizer.train(job_name="my-training-job")
```

### 4. Monitor


Track your training progress directly from the SDK.

```
from amzn_nova_forge.monitor.log_monitor import CloudWatchLogMonitor

# Monitor training logs
customizer.get_logs()

# Or monitor directly via CloudWatchLogMonitor
monitor = CloudWatchLogMonitor.from_job_result(result)
monitor.show_logs(limit=10)

# Check job status
result.get_job_status() # InProgress, Completed, Failed
```

### 5. Evaluate


Evaluate model performance with a variety of [built-in benchmarks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-benchmark), or design your own evaluations.

```
from amzn_nova_forge.recipe_config.eval_config import EvaluationTask

# Evaluate on benchmark tasks
eval_result = customizer.evaluate(
    job_name="model-eval",
    eval_task=EvaluationTask.MMLU,
    model_path=result.model_artifacts.checkpoint_s3_path
)
```

### 6. Deploy


Deploy your customized model to production with built-in support for Amazon Bedrock or SageMaker.

```
from amzn_nova_forge.model.model_enums import DeployPlatform

# Bedrock provisioned throughput
deployment = customizer.deploy(
    model_artifact_path=result.model_artifacts.checkpoint_s3_path,
    deploy_platform=DeployPlatform.BEDROCK_PT,
    pt_units=10
)

# Bedrock On-Demand
deployment = customizer.deploy(
    model_artifact_path=result.model_artifacts.checkpoint_s3_path,
    deploy_platform=DeployPlatform.BEDROCK_OD,
    pt_units=10
)

# Sagemaker Real-time Inference
deployment = customizer.deploy(
    model_artifact_path=result.model_artifacts.checkpoint_s3_path,
    deploy_platform=DeployPlatform.SAGEMAKER,
    unit_count=10,
    sagemaker_instance_type="ml.p5.48xlarge",
    sagemaker_environment_variables={
        "CONTEXT_LENGTH": "12000",
        "MAX_CONCURRENCY": "16",
    }
)
```

## Key Capabilities


### On The Fly Recipe Creation


The SDK eliminates the need to search for the appropriate recipes or container URI for specific techniques.

### Intelligent Data Processing


The SDK automatically transforms your data into the correct format for training. Whether you're working with JSON, JSONL, or CSV files, the data loader handles the conversion seamlessly. Data Loader supports text as well as multimodal data (images and videos).

### Enterprise Infrastructure Support


The SDK works with both SageMaker Training Jobs and SageMaker HyperPod, automatically managing:
+ Instance type validation
+ Recipe validation
+ Dataset validation
+ Job orchestration and monitoring

The SDK also supports SageMaker Training Jobs serverless and Bedrock customization.

### Comprehensive evaluation


Evaluate your customized models against [standard benchmarks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-evaluate.html) including:
+ MMLU (Massive Multitask Language Understanding)
+ BBH (Advanced Reasoning Tasks)
+ GPQA (Graduate-Level Google-Proof Q&A)

Either use the benchmark defaults, or modify them to fit your needs:
+ BYOM (Bring Your Own Metric)
+ BYOD (Bring Your Own Dataset)

### Production Deployment


Deploy your models to Amazon Bedrock or SageMaker AI with options for:
+ **Bedrock Provisioned Throughput** - Dedicated capacity for consistent performance
+ **Bedrock On-Demand (only applicable to LoRA based customization)** - Pay-per-use pricing
+ **SageMaker AI Real-time Inference** - Dedicated capacity for consistent performance

### Batch Inference


Run large-scale inference jobs efficiently:
+ Process thousands of requests in parallel
+ Automatic result aggregation
+ Cost-effective batch processing

### Nova Forge


For Nova Forge subscribers, the SDK supports data mixing recipes.

## Learn More


Ready to start customizing Nova models with the Nova Forge SDK? Check out our GitHub repository for detailed guides, API references, and additional examples: [https://github.com/aws/nova-forge-sdk](https://github.com/aws/nova-forge-sdk)

# Training for Amazon Nova models
Training

Training Amazon Nova models on SageMaker Training Jobs supports Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each technique serves different customization needs and can be applied to different Amazon Nova model versions.

**Topics**
+ [

# Fine-tune Nova 2.0
](nova-fine-tune-2.md)
+ [

# Reinforcement Fine-Tuning (RFT) with Amazon Nova models
](nova-reinforcement-fine-tuning.md)

# Fine-tune Nova 2.0
Supervised fine-tuning (SFT)

## Prerequisites
Prerequisites

Before you start a training job, note the following.
+ Amazon S3 buckets to store your input data and output of training jobs. You can either use one bucket for both or separate buckets for each type of the data. Make sure your buckets are in the same AWS Region where you create all the other resources for training. For more information, see [Creating a general purpose bucket](https://docs.aws.amazon.com//AmazonS3/latest/userguide/create-bucket-overview.html).
+ An IAM role with permissions to run a training job. Make sure you attach an IAM policy with `AmazonSageMakerFullAccess`. For more information, see [How to use SageMaker AI execution roles](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-roles.html).
+ Base Amazon Nova recipes, see [Getting Amazon Nova recipes](nova-model-recipes.md#nova-model-get-recipes).

## What is SFT?


Supervised fine-tuning (SFT) trains a language model using labeled input-output pairs. The model learns from demonstration examples consisting of prompts and responses, refining its capabilities to align with specific tasks, instructions, or desired behaviors.

## Data preparation


### Overview


Nova 2.0 SFT data uses the same Converse API format as Nova 1.0, with the addition of optional reasoning content fields. For complete format specifications, see:
+ Reasoning content: [ReasoningContentBlock](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ReasoningContentBlock.html)
+ Converse API schema: [Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference-call.html)
+ Dataset constraints: [Dataset constraints](https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-prepare-data-understanding.html)

### Supported features

+ **Input types** – Text, image, or video in user content blocks
+ **Assistant content** – Text-only responses and reasoning content
+ **Dataset composition** – Must be homogeneous. Choose one of:
  + Text-only turns
  + Text \$1 image turns
  + Text \$1 video turns (supports document understanding)

**Important**  
You cannot mix images and videos within the same dataset or across different turns.

### Current limitations

+ **Multimodal reasoning content** – Although the Converse format supports image-based reasoning content, Nova 2.0 SFT supports only text-based reasoning content in the reasoningText field.
+ **Validation sets** – You cannot provide a validation dataset for SFT with Nova 2.0. If you provide a validation dataset, it is ignored during training. This limitation applies to both UI-based and programmatic job submissions.

### Supported media formats

+ **Images** – PNG, JPEG, GIF
+ **Videos** – MOV, MKV, MP4

### Data format examples


------
#### [ Text-only (Nova 1.0 compatible) ]

```
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "system": [  
    {  
      "text": "You are a digital assistant with a friendly personality"  
    }  
  ],  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "text": "What country is right next to Australia?"  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "text": "The closest country is New Zealand"  
        }  
      ]  
    }  
  ]  
}
```

------
#### [ Text with reasoning (Nova 2.0) ]

```
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "system": [  
    {  
      "text": "You are a digital assistant with a friendly personality"  
    }  
  ],  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "text": "What country is right next to Australia?"  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "reasoningContent": {  
            "reasoningText": {  
              "text": "I need to use my world knowledge of geography to answer this question"  
            }  
          }  
        },  
        {  
          "text": "The closest country to Australia is New Zealand, located to the southeast across the Tasman Sea."  
        }  
      ]  
    }  
  ]  
}
```

------
#### [ Image \$1 text input ]

```
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "system": [  
    {  
      "text": "You are a helpful assistant."  
    }  
  ],  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "image": {  
            "format": "jpeg",  
            "source": {  
              "s3Location": {  
                "uri": "s3://your-bucket/your-path/your-image.jpg",  
                "bucketOwner": "your-aws-account-id"  
              }  
            }  
          }  
        },  
        {  
          "text": "Which country is highlighted in the image?"  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "reasoningContent": {  
            "reasoningText": {  
              "text": "I will determine the highlighted country by examining its location on the map and using my geographical knowledge"  
            }  
          }  
        },  
        {  
          "text": "The highlighted country is New Zealand"  
        }  
      ]  
    }  
  ]  
}
```

------
#### [ Video \$1 text input ]

```
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "system": [  
    {  
      "text": "You are a helpful assistant."  
    }  
  ],  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "video": {  
            "format": "mp4",  
            "source": {  
              "s3Location": {  
                "uri": "s3://your-bucket/your-path/your-video.mp4",  
                "bucketOwner": "your-aws-account-id"  
              }  
            }  
          }  
        },  
        {  
          "text": "What is shown in this video?"  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "reasoningContent": {  
            "reasoningText": {  
              "text": "I will analyze the video content to identify key elements"  
            }  
          }  
        },  
        {  
          "text": "The video shows a map with New Zealand highlighted"  
        }  
      ]  
    }  
  ]  
}
```

------

## Tool calling


Nova 2.0 SFT supports training models on tool calling patterns, enabling your model to learn when and how to invoke external tools or functions.

### Data format for tool calling


Tool calling training data includes a `toolConfig` section that defines available tools, along with conversation turns that demonstrate tool usage patterns.

**Sample input**

```
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "system": [  
    {  
      "text": "You are an expert in composing function calls."  
    }  
  ],  
  "toolConfig": {  
    "tools": [  
      {  
        "toolSpec": {  
          "name": "getItemCost",  
          "description": "Retrieve the cost of an item from the catalog",  
          "inputSchema": {  
            "json": {  
              "type": "object",  
              "properties": {  
                "item_name": {  
                  "type": "string",  
                  "description": "The name of the item to retrieve cost for"  
                },  
                "item_id": {  
                  "type": "string",  
                  "description": "The ASIN of item to retrieve cost for"  
                }  
              },  
              "required": [  
                "item_id"  
              ]  
            }  
          }  
        }  
      },  
      {  
        "toolSpec": {  
          "name": "getItemAvailability",  
          "description": "Retrieve whether an item is available in a given location",  
          "inputSchema": {  
            "json": {  
              "type": "object",  
              "properties": {  
                "zipcode": {  
                  "type": "string",  
                  "description": "The zipcode of the location to check in"  
                },  
                "quantity": {  
                  "type": "integer",  
                  "description": "The number of items to check availability for"  
                },  
                "item_id": {  
                  "type": "string",  
                  "description": "The ASIN of item to check availability for"  
                }  
              },  
              "required": [  
                "item_id", "zipcode"  
              ]  
            }  
          }  
        }  
      }  
    ]  
  },  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "text": "I need to check whether there are twenty pieces of the following item available. Here is the item ASIN on Amazon: id-123. Please check for the zipcode 94086"  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "reasoningContent": {  
            "reasoningText": {  
              "text": "The user wants to check how many pieces of the item with ASIN id-123 are available in the zipcode 94086"  
            }  
          }  
        },  
        {  
          "toolUse": {  
            "toolUseId": "getItemAvailability_0",  
            "name": "getItemAvailability",  
            "input": {  
              "zipcode": "94086",  
              "quantity": 20,  
              "item_id": "id-123"  
            }  
          }  
        }  
      ]  
    },  
    {  
      "role": "user",  
      "content": [  
        {  
          "toolResult": {  
            "toolUseId": "getItemAvailability_0",  
            "content": [  
              {  
                "text": "[{\"name\": \"getItemAvailability\", \"results\": {\"availability\": true}}]"  
              }  
            ]  
          }  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {  
          "text": "Yes, there are twenty pieces of item id-123 available at 94086. Would you like to place an order or know the total cost?"  
        }  
      ]  
    }  
  ]  
}
```

### Tool calling requirements


When creating tool calling training data, follow these requirements:


| Requirement | Description | 
| --- | --- | 
| ToolUse placement | ToolUse must appear in assistant turns only | 
| ToolResult placement | ToolResult must appear in user turns only | 
| ToolResult format | ToolResult should be text or JSON only. Other modalities are not supported for Nova models | 
| inputSchema format | The inputSchema within the toolSpec must be a valid JSON Schema object | 
| toolUseId matching | Each ToolResult must reference a valid toolUseId from a preceding assistant ToolUse, with each toolUseId used exactly once per conversation | 

### Important notes

+ Ensure your tool definitions are consistent across all training samples
+ The model learns tool invocation patterns from the demonstrations you provide
+ Include diverse examples of when to use each tool and when not to use tools

## Document understanding


Nova 2.0 SFT supports training on document-based tasks, enabling your model to learn how to analyze and respond to questions about PDF documents.

### Data format for document understanding


Document understanding training data includes document references in the user content blocks, with the model learning to extract and reason over document content.

**Sample input**

```
{  
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"  
        },  
        {  
          "document": {  
            "format": "pdf",  
            "source": {  
              "s3Location": {  
                "uri": "s3://my-bucket-name/path/to/documents/customer_service_debugging.pdf",  
                "bucketOwner": "123456789012"  
              }  
            }  
          }  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {
          "reasoningContent": {  
            "reasoningText": {  
              "text": "I need to find the relevant section in the document to answer the question."  
            }  
          }
        },
        {  
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"  
        }   
      ]
    }  
  ]  
}
}
```

### Document understanding limitations



| Limitation | Details | 
| --- | --- | 
| Supported format | PDF files only | 
| Maximum document size | 10 MB | 
| Modality mixing | A sample can have documents and text, but cannot have documents mixed with other modalities (images, videos) | 

### Best practices for document understanding

+ Ensure documents are clearly formatted and text is extractable
+ Provide diverse examples covering different document types and question formats
+ Include reasoning content to help the model learn document analysis patterns

## Video understanding


Nova 2.0 SFT supports training on video-based tasks, enabling your model to learn how to analyze and respond to questions about video content.

### Data format for video understanding


Video understanding training data includes video references in the user content blocks, with the model learning to extract information and reason over video content.

**Sample input**

```
  
{  
  "schemaVersion": "bedrock-conversation-2024",  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"  
        },  
        {  
          "video": {  
            "format": "mp4",  
            "source": {  
              "s3Location": {  
                "uri": "s3://my-bucket-name/path/to/videos/customer_service_debugging.mp4",  
                "bucketOwner": "123456789012"  
              }  
            }  
          }  
        }  
      ]  
    },  
    {  
      "role": "assistant",  
      "content": [  
        {
          "reasoningContent": {  
            "reasoningText": {  
              "text": "I need to find the relevant section in the video to answer the question."  
            }  
          }
        },
        {  
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"  
        }   
      ]  
    }  
  ]  
}
```

### Video understanding limitations



| Limitation | Details | 
| --- | --- | 
| Maximum video size | 50 MB | 
| Maximum video duration | 15 minutes | 
| Videos per sample | Only one video is allowed per sample. Multiple videos in the same sample are not supported | 
| Modality mixing | A sample can have video and text, but cannot have video combined with other modalities (images, documents) | 

### Supported video formats

+ MOV
+ MKV
+ MP4

### Best practices for video understanding

+ Keep videos concise and focused on the content relevant to your task
+ Ensure video quality is sufficient for the model to extract meaningful information
+ Provide clear questions that reference specific aspects of the video content
+ Include diverse examples covering different video types and question formats

## Reasoning vs non-reasoning modes


### Understanding reasoning content


Reasoning content (also called chain-of-thought) captures the model's intermediate thinking steps before generating a final answer. In the `assistant` turn, use the `reasoningContent` field to include these reasoning traces.

**How loss is calculated**
+ **With reasoning content** – Training loss includes both reasoning tokens and final output tokens
+ **Without reasoning content** – Training loss is calculated only on the final output tokens

You can include `reasoningContent` across multiple assistant turns in multi-turn conversations.

**Formatting guidelines**
+ Use plain text for reasoning content
+ Avoid markup tags like `<thinking>` and `</thinking>` unless specifically required by your task
+ Ensure reasoning content is clear and relevant to the problem-solving process

### When to enable reasoning mode


Set `reasoning_enabled: true` in your training configuration when:
+ Your training data has reasoning tokens
+ You want the model to generate thinking tokens before producing final outputs
+ You need improved performance on complex reasoning tasks

Training Nova on a non-reasoning dataset with `reasoning_enabled = true` is permitted. However, doing so may cause the model to lose its reasoning capabilities, as Nova primarily learns to generate the responses presented in the data without applying reasoning. If you want to train Nova on a non-reasoning dataset but still expect reasoning during inference, you can disable reasoning during training (`reasoning_enabled = false`) but enable it for inference. While this approach allows reasoning to be used at inference time, it does not guarantee improved performance compared to inference without reasoning. In general, enable reasoning for both training and inference when using reasoning datasets, and disable it for both when using non-reasoning datasets.

Set `reasoning_enabled: false` when:
+ Your training data does not have reasoning tokens
+ You're training on straightforward tasks that don't benefit from explicit reasoning steps
+ You want to optimize for speed and reduce token usage

### Generating reasoning data


If your dataset lacks reasoning traces, you can create them using a reasoning-capable model like Nova Premier. Provide your input-output pairs to the model and capture its reasoning process to build a reasoning-augmented dataset.

### Using reasoning tokens for training


When training with reasoning mode enabled, the model learns to separate internal reasoning from the final answer. The training process:
+ Organizes data as triples: input, reasoning, and answer
+ Optimizes using standard next-token prediction loss from both reasoning and answer tokens
+ Encourages the model to reason internally before generating responses

### Effective reasoning content


High-quality reasoning content should include:
+ Intermediate thoughts and analysis
+ Logical deductions and inference steps
+ Step-by-step problem-solving approaches
+ Explicit connections between steps and conclusions

This helps the model develop the ability to "think before answering."

## Dataset preparation guidelines


### Size and quality

+ **Recommended size** – 2,000-10,000 samples
+ **Minimum samples** – 200
+ **Priority** – Quality over quantity. Ensure examples are accurate and well-annotated
+ **Application alignment** – Dataset should closely reflect your production use cases

### Diversity


Include diverse examples that:
+ Cover the full range of expected inputs
+ Represent different difficulty levels
+ Include edge cases and variations
+ Prevent overfitting to narrow patterns

### Output formatting


Clearly specify the desired output format in assistant responses:
+ JSON structures
+ Tables
+ CSV format
+ Custom formats specific to your application

### Multi-turn conversations


For multi-turn datasets, remember:
+ Loss is calculated only on assistant turns, not user turns
+ Each assistant response should be properly formatted
+ Maintain consistency across conversation turns

### Quality checklist

+ Sufficient dataset size (2K-10K samples)
+ Diverse examples covering all use cases
+ Clear, consistent output formatting
+ Accurate labels and annotations
+ Representative of production scenarios
+ Free from contradictions or ambiguities

### Uploading your data


Datasets should be uploaded to a bucket that can be accessed by SageMaker training jobs. For information about setting the right permissions, see [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-general-prerequisites.html).

## Starting a training job


### Selecting hyperparameters and updating the recipe


The setup for Nova 2.0 is largely the same as for Nova 1.0. Once the input data has been uploaded to S3, use the recipe from [SageMaker HyperPod Recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova) under Fine tuning folder. For Nova 2.0, the following are some of the key hyperparameters that you can update based on the use case. The following is an example of the Nova 2.0 SFT PEFT recipe. For the container image URI, use `708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-SFT-V2-latest` to run an SFT fine-tuning job.

Please use v2.254.1 of SageMaker AI PySDK for strict compatibility with Nova training. Upgrading the SDK to v3.0 version will result in breaking changes. Support for v3 of SageMaker AI PySDK is coming soon.

**Sample Input **

```
!pip install sagemaker==2.254.1
```

```
run:  
  name: {peft_recipe_job_name}  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: {peft_model_name_or_path}  
  data_s3_path: {train_dataset_s3_path} # SageMaker HyperPod (SMHP) only and not compatible with SageMaker Training jobs. Note replace my-bucket-name with your real bucket name for SMHP job  
  replicas: 4                      # Number of compute instances for training, allowed values are 4, 8, 16, 32  
  output_s3_path: ""               # Output artifact path (Hyperpod job-specific; not compatible with standard SageMaker Training jobs). Note replace my-bucket-name with your real bucket name for SMHP job  
  
training_config:  
  max_steps: 10                   # Maximum training steps. Minimal is 4.  
  save_steps: 10                      # How many training steps the checkpoint will be saved. Should be less than or equal to max_steps  
  save_top_k: 1                    # Keep top K best checkpoints. Note supported only for SageMaker HyperPod jobs. Minimal is 1.  
  max_length: 32768                # Sequence length (options: 8192, 16384, 32768 [default], 65536)  
  global_batch_size: 32            # Global batch size (options: 32, 64, 128)  
  reasoning_enabled: true          # If data has reasoningContent, set to true; otherwise False  
  
  lr_scheduler:  
    warmup_steps: 15               # Learning rate warmup steps. Recommend 15% of max_steps  
    min_lr: 1e-6                   # Minimum learning rate, must be between 0.0 and 1.0  
  
  optim_config:                    # Optimizer settings  
    lr: 1e-5                       # Learning rate, must be between 0.0 and 1.0  
    weight_decay: 0.0              # L2 regularization strength, must be between 0.0 and 1.0  
    adam_beta1: 0.9                # Exponential decay rate for first-moment estimates, must be between 0.0 and 1.0  
    adam_beta2: 0.95               # Exponential decay rate for second-moment estimates, must be between 0.0 and 1.0  
  
  peft:                            # Parameter-efficient fine-tuning (LoRA)  
    peft_scheme: "lora"            # Enable LoRA for PEFT  
    lora_tuning:  
      alpha: 64                    # Scaling factor for LoRA weights ( options: 32, 64, 96, 128, 160, 192),  
      lora_plus_lr_ratio: 64.0
```

The recipe also contains largely the same hyperparameters as Nova 1.0. The notable hyperparameters are:
+ `max_steps` – The number of steps you want to run the job for. Generally, for one epoch (one run through your entire dataset), the number of steps = number of data samples / global batch size. The larger the number of steps and the smaller your global batch size, the longer the job will take to run.
+ `reasoning_enabled` – Controls reasoning mode for your dataset. Options:
  + `true`: Enables reasoning mode (equivalent to high reasoning)
  + `false`: Disables reasoning mode

  Note: For SFT, there is no granular control over reasoning effort levels. Setting `reasoning_enabled: true` enables full reasoning capability.
+ `peft.peft_scheme` – Setting this to "lora" enables PEFT-based fine tuning. Setting it to null (no quotes) enables Full-Rank fine tuning.

### Start the training job


```
from sagemaker.pytorch import PyTorch  
  
# define OutputDataConfig path  
if default_prefix:  
    output_path = f"s3://{bucket_name}/{default_prefix}/{sm_training_job_name}"  
else:  
    output_path = f"s3://{bucket_name}/{sm_training_job_name}"  

output_kms_key = "<KMS key arn to encrypt trained model in Amazon-owned S3 bucket>" # optional, leave blank for Amazon managed encryption
  
recipe_overrides = {  
    "run": {  
        "replicas": instance_count,  # Required  
        "output_s3_path": output_path  
    },  
}  
  
estimator = PyTorch(  
    output_path=output_path,  
    base_job_name=sm_training_job_name,  
    role=role,  
    disable_profiler=True,  
    debugger_hook_config=False,  
    instance_count=instance_count,  
    instance_type=instance_type,  
    training_recipe=training_recipe,  
    recipe_overrides=recipe_overrides,  
    max_run=432000,  
    sagemaker_session=sagemaker_session,  
    image_uri=image_uri,
    output_kms_key=output_kms_key,
    tags=[  
        {'Key': 'model_name_or_path', 'Value': model_name_or_path},  
    ]  
)  
  
print(f"\nsm_training_job_name:\n{sm_training_job_name}\n")  
print(f"output_path:\n{output_path}")
```

```
from sagemaker.inputs import TrainingInput  
  
train_input = TrainingInput(  
    s3_data=train_dataset_s3_path,  
    distribution="FullyReplicated",  
    s3_data_type="Converse",  
)  
  
estimator.fit(inputs={"validation": val_input}, wait=False)
```

**Note**  
Passing a validation dataset is not supported for supervised fine tuning of Nova 2.0.

To kick off the job:
+ Update the recipe with your dataset paths and hyperparameters
+ Execute the specified cells in the notebook to submit the training job

The notebook handles job submission and provides status tracking.

# Preparing data for multimodal fine-tuning


The following are guidelines and requirements for preparing data for fine-tuning Understanding models:

1. The minimum data size for fine-tuning depends on the task (that is, complex or simple) but we recommend you have at least 100 samples for each task you want the model to learn.

1. We recommend using your optimized prompt in a zero-shot setting during both training and inference to achieve the best results.

1. Traning and validation datasets must be JSONL files, where each line is a JSON object corresponding to a record. These file names can consist of only alphanumeric characters, underscores, hyphens, slashes and dots.

1. Image and video constraints

   1. Dataset can't contain different media modalities. That is, the dataset can either be text with images or text with videos.

   1. One sample (single record in messages) can have multiple images

   1. One sample (single record in messages) can have only 1 video

1. `schemaVersion` can be any string value

1. The (*optional*) `system` turn can be a customer-provided custom system prompt.

1. Supported roles are `user` and `assistant`.

1. The first turn in `messages` should always start with `"role": "user"`. The last turn is the bot's response, denoted by "role": "assistant".

1. The `image.source.s3Location.uri` and `video.source.s3Location.uri` must be accessible to Amazon Bedrock.

1.  Your Amazon Bedrock service role must be able to access the image files in Amazon S3. For more information about granting access, see [Create a service role for model customization](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-iam-role.html) 

1. The images or videos must be in the same Amazon S3 bucket as your dataset. For example, if your dataset is in `s3://amzn-s3-demo-bucket/train/train.jsonl`, then your images or videos must be in `s3://amzn-s3-demo-bucket`

1. The terms `User:`, `Bot:`, `Assistant:`, `System:`, `<image>`, `<video>` and `[EOS]` are reserved keywords. If a user prompt or system prompt starts with any of these keywords, or have these keywords anywhere in their prompts, your training job will fail due to data issues. If you need to use these keywords for your use case, you must substitute it for a different keyword with a similar meaning so that your training can proceed.

**Note**  
To validate your dataset before submitting a fine-tuning job, you can use the [dataset validation script](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/bedrock-finetuning/understanding/dataset_validation) available on GitHub.

**Topics**
+ [

## Example dataset formats
](#customize-fine-tune-examples)
+ [

## Dataset constraints
](#custom-fine-tune-constraints)

## Example dataset formats


The following example dataset formats provide a guide for you to follow.

### Text-only custom fine tuning format


The following example is for custom fine tuning over text only.

```
// train.jsonl
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a digital assistant with a friendly personality"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What is the capital of Mars?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Mars does not have a capital. Perhaps it will one day."
        }
      ]
    }
  ]
}
```

### Single image custom fine tuning format


The following example is for custom fine tuning over text and a single image.

```
// train.jsonl{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [{
        "text": "You are a smart assistant that answers questions respectfully"
    }],
    "messages": [{
            "role": "user",
            "content": [{
                    "text": "What does the text in this image say?"
                },
                {
                    "image": {
                        "format": "png",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.png",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": [{
                "text": "The text in the attached image says 'LOL'."
            }]
        }
    ]
}
```

### Video custom fine tuning format


The following example is for custom fine tuning over text and video.

```
{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [{
        "text": "You are a helpful assistant designed to answer questions crisply and to the point"
    }],
    "messages": [{
            "role": "user",
            "content": [{
                    "text": "How many white items are visible in this video?"
                },
                {
                    "video": {
                        "format": "mp4",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-video.mp4",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": [{
                "text": "There are at least eight visible items that are white"
            }]
        }
    ]
}
```

## Dataset constraints


Amazon Nova applies the following constraints on model customizations for Understanding models.


| Model | Minimum Samples | Maximum Samples | Context Length | 
| --- |--- |--- |--- |
| Nova 2 Lite | 8 | 20k | 32k | 


**Image and video constraints**  

|  |  | 
| --- |--- |
| Maximum images | 10/sample | 
| Maximum image file size | 10 MB | 
| Maximum videos | 1/sample | 
| Maximum video length/duration | 90 seconds | 
| Maximum video file size | 50 MB | 

**Supported media formats**
+ Image - `png`, `jpeg`, `gif`, `webp`
+ Video - `mov`, `mkv`, `mp4`, `webm`

# Reinforcement Fine-Tuning (RFT) with Amazon Nova models
Reinforcement Fine-Tuning (RFT)

## Overview


**What is RFT?**

Reinforcement fine-tuning (RFT) improves model performance by training on feedback signals—measurable scores or rewards indicating how well the model performed—rather than exact correct answers. Unlike supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards. This approach excels when defining the exact correct output is challenging, but you can reliably measure response quality.

**When to use RFT**

Use RFT when you can define clear, measurable success criteria but struggle to provide exact correct outputs for training. It's ideal for:
+ Tasks where quality is subjective or multifaceted (creative writing, code optimization, complex reasoning)
+ Scenarios with multiple valid solutions where some are clearly better than others
+ Applications requiring iterative improvement, personalization, or adherence to complex business rules
+ Cases where collecting high-quality labeled examples is expensive or impractical

**Best use cases**

RFT excels in domains where output quality can be objectively measured but optimal responses are difficult to define upfront:
+ Mathematical problem-solving and code generation
+ Scientific reasoning and structured data analysis
+ Tasks requiring step-by-step reasoning or multi-turn problem solving
+ Applications balancing multiple objectives (accuracy, efficiency, style)
+ Scenarios where success can be verified programmatically through execution results or performance metrics

**Supported models**

Nova Lite 2.0

## Data format overview


RFT training data must follow the OpenAI Reinforcement Fine-Tuning [format](https://platform.openai.com/docs/api-reference/fine-tuning/reinforcement-input). Each training example is a JSON object containing:
+ A `messages` array with conversational turns using `system` and `user` roles
+ A `reference_answer` field containing the expected output or evaluation criteria for reward calculation

**Current limitations**
+ Text only

### Data format examples


Each example should be on a single line in your JSONL file, with one JSON object per line.

------
#### [ Chemistry problem ]

```
{
  "id": "chem-01",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful chemistry assistant"
    },
    {
      "role": "user",
      "content": "Calculate the molecular weight of caffeine (C8H10N4O2)"
    }
  ],
  "reference_answer": {
    "molecular_weight": 194.19,
    "unit": "g/mol",
    "calculation": "8(12.01) + 10(1.008) + 4(14.01) + 2(16.00) = 194.19"
  }
}
```

------
#### [ Math problem ]

```
{
  "id": "sample-001",  // Optional
  "messages": [
    {
      "role": "system",
      "content": "You are a math tutor"
    },
    {
      "role": "user",
      "content": "Solve: 2x + 5 = 13"
    }
  ],
  "reference_answer": {
    "solution": "x = 4",
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]
  }
}
```

------
#### [ Code problem ]

```
{
  "id": "code-002",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful programming assistant"
    },
    {
      "role": "user",
      "content": "Write a Python function that reverses a string without using built-in reverse methods"
    }
  ],
  "reference_answer": {
    "code": "def reverse_string(s):  \n    result = ''  \n    for i in range(len(s) - 1, -1, -1):  \n        result += s[i]  \n    return result",
    "test_cases": [
      {
        "input": "hello",
        "expected_output": "olleh"
      },
      {
        "input": "",
        "expected_output": ""
      },
      {
        "input": "a",
        "expected_output": "a"
      },
      {
        "input": "Python123",
        "expected_output": "321nohtyP"
      }
    ],
    "all_tests_pass": true
  }
}
```

------

The `reference_answer` field contains the expected output or evaluation criteria that your reward function uses to score the model's response. It is not limited to structured outputs—it can contain any format that helps your reward function evaluate quality.

## Dataset size recommendations


**Starting point**
+ Minimum 100 training examples
+ Minimum 100 evaluation examples

**Evaluation-first approach**

Before investing in large-scale RFT training, evaluate your model's baseline performance:
+ **High performance (>95% reward)** – RFT may be unnecessary—your model already performs well
+ **Very poor performance (0% reward)** – Switch to SFT first to establish basic capabilities
+ **Moderate performance** – RFT is likely appropriate

Starting with a small dataset allows you to:
+ Validate your reward function is bug-free
+ Confirm RFT is the right approach for your use case
+ Identify and fix issues early
+ Test the workflow before scaling up

Once validated, you can expand to larger datasets to further improve performance.

## Characteristics of effective training data


**Clarity and consistency**

Good RFT examples require clear, unambiguous input data that enables accurate reward calculation across different model outputs. Avoid noise in your data, including:
+ Inconsistent formatting
+ Contradictory labels or instructions
+ Ambiguous prompts
+ Conflicting reference answers

Any ambiguity will mislead the training process and cause the model to learn unintended behaviors.

**Diversity**

Your dataset should capture the full diversity of production use cases to ensure robust real-world performance. Include:
+ Different input formats and edge cases
+ Map actual production usage patterns from logs and user analytics
+ Sample across user types, geographic regions, and seasonal variations
+ Include difficulty levels from simple to complex problems

**Reward function considerations**

Design your reward function for efficient training:
+ Execute within seconds (not minutes)
+ Parallelize effectively with Lambda
+ Return consistent, reliable scores
+ Handle different types of model outputs gracefully

Fast, scalable reward functions enable rapid iteration and cost-effective experimentation.

## Additional properties


The RFT data format supports custom fields beyond the core schema requirements (`messages` and `reference_answer`). This flexibility lets you add any additional data your reward function needs for proper evaluation.

**Note**  
You don't need to configure this in your recipe—the data format inherently supports additional fields. Simply include them in your training data JSON, and they will be passed to your reward function in the `metadata` field.

**Common additional properties**

Example metadata fields:
+ `task_id` – Unique identifier for tracking
+ `difficulty_level` – Problem complexity indicator
+ `domain` – Subject area or category
+ `expected_reasoning_steps` – Number of steps in solution

**Example with additional properties**

```
{
  "messages": [
    {
      "role": "system",
      "content": "You are a math tutor"
    },
    {
      "role": "user",
      "content": "Solve: 2x + 5 = 13"
    }
  ],
  "reference_answer": {
    "solution": "x = 4",
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]
  },
  "task_id": "algebra_001",
  "difficulty_level": "easy",
  "domain": "algebra",
  "expected_reasoning_steps": 3
}
```

These additional fields are passed to your reward function during evaluation, enabling sophisticated scoring logic tailored to your specific use case.

## Training configuration


**Sample recipe**

```
# Note:
# This recipe can run on p5.48xlarge and p5en.48xlarge instance types.
run:
  name: "my-rft-run"                           # Unique run name (appears in logs/artifacts).
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://<bucket>/<data file>      # Training dataset in JSONL;
  replicas: 4
  reward_lambda_arn: arn:aws:lambda:<region>:<account-id>:function:<function-name>

  ## MLFlow configs
  mlflow_tracking_uri: "" # Required for MLFlow
  mlflow_experiment_name: "my-rft-experiment" # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-rft-run" # Optional for MLFlow. Note: leave this field non-empty

## SMTJ GRPO Training specific configs
training_config:
  max_length: 8192                              # Context window (tokens) for inputs+prompt;
  global_batch_size: 16                         # Total samples per optimizer step across all replicas (16/32/64/128/256).
  reasoning_effort: high                        # Enables reasoning mode high / low / or null for non-reasoning

  rollout:                                      # How responses are generated for GRPO/advantage calc.
    advantage_strategy:
      number_generation: 2                      # N samples per prompt to estimate advantages (variance vs cost).
    generator:
      max_new_tokens: 6000                      # Cap on tokens generated per sample
      set_random_seed: true                     # Seed generation for reproducibility across runs.
      temperature: 1                            # Softmax temperature;
      top_k: 1                                  # Sample only from top-K logits
    rewards:
      preset_reward_function: null              # Usage of preset reward functions [exact_match]
      api_endpoint:
        lambda_arn: arn:aws:lambda:<region>:<account-id>:function:<function-name>
        lambda_concurrency_limit: 12             # Max concurrent Lambda invocations (throughput vs. throttling).

  trainer:
    max_steps: 2                                 # Steps to train for. One Step = global_batch_size
    save_steps: 5
    test_steps: 1

    # RL parameters
    ent_coeff: 0.0                              # A bonus added to the policy loss that rewards higher-output entropy.
    kl_loss_coef: 0.001                         # Weight on the KL penalty between the actor (trainable policy) and a frozen reference model

    optim_config:                    # Optimizer settings
        lr: 5e-5                       # Learning rate
        weight_decay: 0.0              # L2 regularization strength (0.0–1.0)
        adam_beta1: 0.9
        adam_beta2: 0.95

    peft:                            # Parameter-efficient fine-tuning (LoRA)
        peft_scheme: "lora"            # Enable LoRA for PEFT
        lora_tuning:
            alpha: 32
            lora_plus_lr_ratio: 64.0     # LoRA+ learning rate scaling factor (0.0–100.0)
```

## RFT training using LLM as a judge


### Overview


Large language models (LLMs) are increasingly being used as judges in reinforcement fine-tuning (RFT) workflows, providing automated reward signals that guide model optimization. In this approach, an LLM evaluates model outputs against specified criteria—whether assessing correctness, quality, style adherence, or semantic equivalence—and assigns rewards that drive the reinforcement learning process.

This is particularly valuable for tasks where traditional reward functions are difficult to define programmatically, such as determining whether different representations (like "1/3", "0.333", and "one-third") are semantically equivalent, or evaluating nuanced qualities like coherence and relevance. By using LLM-based judges as reward functions, you can scale RFT to complex domains without requiring extensive human annotation, enabling rapid iteration and continuous improvement of your models across diverse use cases beyond traditional alignment problems.

### Reasoning mode selection


**Available modes**
+ none – No reasoning (omit the reasoning\$1effort field)
+ low – Minimal reasoning overhead
+ high – Maximum reasoning capability (default when reasoning\$1effort is specified)

**Note**  
There is no medium option for RFT. If the reasoning\$1effort field is absent from your configuration, reasoning is disabled. When reasoning is enabled, you should set `max_new_tokens` to 32768 to accommodate extended reasoning outputs.

**When to use each mode**

Use high reasoning for:
+ Complex analytical tasks
+ Mathematical problem-solving
+ Multi-step logical deduction
+ Tasks where step-by-step thinking adds value

Use none (omit reasoning\$1effort) or low reasoning for:
+ Simple factual queries
+ Direct classifications
+ Speed and cost optimization
+ Straightforward question-answering

**Cost and performance trade-offs**

Higher reasoning modes increase:
+ Training time and cost
+ Inference latency and cost
+ Model capability for complex reasoning tasks

### Validating your LLM judge


Before deploying an LLM-as-a-judge in production, validate that the judge model's evaluations align with human judgment. This involves:
+ Measuring agreement rates between the LLM judge and human evaluators on representative samples of your task
+ Ensuring that the LLM's agreement with humans meets or exceeds inter-human agreement rates
+ Identifying potential biases in the judge model
+ Building confidence that the reward signal guides your model in the intended direction

This validation step helps ensure the automated evaluation process will produce models that meet your production quality criteria.

### Lambda configuration for LLM judge


Using an LLM as a judge is an extension of using Lambda functions for Reinforcement Learning with Verifiable Rewards (RLVR). Inside the Lambda function, you make a call to one of the models hosted in Amazon Bedrock.

**Important configuration requirements:**


| Configuration | Requirement | Details | 
| --- | --- | --- | 
| Amazon Bedrock throughput | Sufficient quota | Ensure your throughput quota for the Amazon Bedrock model used is sufficient for your training workload | 
| Lambda timeout | Extended timeout | Configure your Lambda function timeout up to the maximum of 15 minutes. The default setting is 3 seconds, which is insufficient for Amazon Bedrock model responses | 
| Lambda concurrency | Increased concurrency | The Lambda gets invoked in parallel during training. Increase concurrency to maximize available throughput | 
| Recipe configuration | Match Lambda settings | The concurrency limit must be configured in your recipe | 

## Creating and running jobs


**Starting a training job**

Use the SageMaker training job notebook template: [https://docs.aws.amazon.com/sagemaker/latest/dg/nova-fine-tuning-training-job.html#nova-model-training-jobs-notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-fine-tuning-training-job.html#nova-model-training-jobs-notebook)

**Instance requirements**

The container supports both Full-Rank and LoRA training:
+ **LoRA training** – 2/4/6/8 × p5.48xlarge or p5en.48xlarge instances
+ **Full-Rank training** – 2/4/6/8 × p5.48xlarge instances (required)

## Monitoring training


Training logs include comprehensive metrics at each step. Key metric categories:

**Reward metrics**
+ `critic/rewards/mean`, `critic/rewards/max`, `critic/rewards/min` – Reward distribution
+ `val-score/rewards/mean@1` – Validation rewards

**Model behavior**
+ `actor/entropy` – Policy variation (higher = more exploratory)

**Training health**
+ `actor/pg_loss` – Policy gradient loss
+ `actor/pg_clipfrac` – Frequency of clipped updates
+ `actor/grad_norm` – Gradient magnitude

**Response characteristics**
+ `prompt_length/mean`, `prompt_length/max`, `prompt_length/min` – Input token statistics
+ `response_length/mean`, `response_length/max`, `response_length/min` – Output token statistics
+ `response/aborted_ratio` – Incomplete generation rate (0 = all completed)

**Performance**
+ `perf/throughput` – Training throughput
+ `perf/time_per_step` – Time per training step
+ `timing_per_token_ms/*` – Per-token processing times

**Resource usage**
+ `perf/max_memory_allocated_gb`, `perf/max_memory_reserved_gb` – GPU memory
+ `perf/cpu_memory_used_gb` – CPU memory

## Using fine-tuned models


After training completes, the final model checkpoint is saved to your specified output location. The checkpoint path is available in:
+ Training logs
+ `manifest.json` file in the output Amazon S3 location (defined by `output_s3_uri` in your notebook)

## Limitations and best practices


**Limitations**
+ **Lambda timeout** – Reward functions must complete within 15 minutes (prevents runaway processes and manages costs)
+ **Single-turn only** – Multi-turn conversations are not supported
+ **Data requirements** – Needs sufficient diversity; struggles with sparse rewards (<5% positive examples)
+ **Computational cost** – More expensive than supervised fine-tuning
+ **No multi-modal data** – Only text data type is supported

**Best practices**

**Start small**
+ Begin with 100-200 examples
+ Validate reward function correctness
+ Scale gradually based on results

**Pre-training evaluation**
+ Test baseline model performance before RFT
+ If rewards are consistently 0%, use SFT first to establish basic capabilities
+ If rewards are >95%, RFT may be unnecessary

**Monitor training**
+ Track average reward scores and distribution
+ Watch for overfitting (training rewards increase while validation rewards decrease)
+ Look for concerning patterns:
  + Rewards plateauing below 0.15
  + Increasing reward variance over time
  + Declining validation performance

**Optimize reward functions**
+ Execute within seconds (not minutes)
+ Minimize external API calls
+ Use efficient algorithms
+ Implement proper error handling
+ Take advantage of Lambda's parallel scaling

**Iteration strategy**

If rewards aren't improving:
+ Adjust reward function design
+ Increase dataset diversity
+ Add more representative examples
+ Verify reward signals are clear and consistent

## Advanced capabilities: Nova Forge


For users requiring advanced capabilities beyond standard RFT limitations, Nova Forge is available as a paid subscription service offering:
+ Multi-turn conversation support
+ Reward functions with >15 minute execution time
+ Additional algorithms and tuning options
+ Custom training recipe modifications
+ State-of-the-art AI techniques

Nova Forge runs on SageMaker HyperPod and is designed to support enterprise customers to build their own frontier models.

## Useful commands and tips


A collection of [observability scripts](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/SageMakerUilts/SageMakerJobsMonitoring) is available to help monitor the status and progress of training jobs.

Available scripts are:
+ Enabling email notifications for training job status updates
+ Obtaining training time estimates based on job configurations
+ Obtaining approximations for how long training is expected to take for in-progress jobs

**Installation**

**Note**  
Be sure to refresh your AWS credentials prior to using any of the following scripts.

```
pip install boto3
git clone https://github.com/aws-samples/amazon-nova-samples.git
cd amazon-nova-samples/customization/SageMakerUilts/SageMakerJobsMonitoring/
```

**Basic usage**

```
# Enabling email notifications for training job status updates
python enable_sagemaker_job_notifs.py --email test@amazon.com test2@gmail.com --region us-east-1 --platform SMTJ

Creating resources........
Please check your email for a subscription confirmation email, and click 'Confirm subscription' to start receiving job status email notifications!
You'll receive the confirmation email within a few minutes.
```

```
# Obtaining training time estimates based on job configurations
python get_training_time_estimate.py
```

```
# Obtaining approximations for how long training is expected to take for in-progress jobs
python get-training-job-progress.py --region us-east-1 --job-name my-training-job --num-dataset-samples 1000
```

Please see [here](https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/SageMakerUilts/SageMakerJobsMonitoring/README.md) for additional details and examples.

# Monitoring Progress Across Iterations
Monitoring Progress Across Iterations

You can track metrics via MLflow.

## Nova Customization - MLFlow setup for SageMaker HyperPod


To enable your SageMaker HyperPod environment to output metrics to MLFlow, it is necessary to do some additional setup.

1. Open Amazon SageMaker AI

1. Select SageMaker Studio

   1. If there is a profile already created, select "Open Studio".

   1. If no profile is created, select "Create A SageMaker Domain" to set one up

1. Select MLFlow. If there is not any MLFlow App created, select "Create MLFlow App"

1. Click on the copy/paste button or the "View Details" menu item on the ML Flow App in order to get the ARN. You will need this when you submit your training job.  
![\[The View Details button on ML Flow App, with ARN.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/how-it-works.png)

1. On the HyperPod cluster execution role, add the following policy. This will allow the HyperPod cluster to call the MLFlow API to publish metrics.

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sagemaker-mlflow:*",
            "Resource": [
                "arn:aws:sagemaker:us-east-1:372836560492:mlflow-app/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListMlflowTrackingServers",
                "sagemaker:CallMlflowAppApi"
            ],
            "Resource": "*"
        }
    ]
}
```

### Submitting a job via the CLI


Specify 4 new override parameters, either in the command line, or in the recipe yaml.

1. `mlflow_tracking_uri`: The ARN of the MLFlow App

1. `mlflow_experiment_name`: The name for this run of the experiment

1. `mlflow_experiment_name`: The experiment name where the metrics will be stored in MLFlow

1. `mlflow_run_name`: The name for this experiment

Command line

```
--override-parameters '{"recipes.run.mlflow_tracking_uri": "arn:aws:sagemaker:us-east-1:925548216816:mlflow-app/app-B6XOUNCHQM4W", "recipes.run.mlflow_experiment_name": "myuser-sft-lora-exp1", "recipes.run.mlflow_run_name": "myuser-sft-lora-exp1-202512181940"}'
```

yaml:

```
## Run config
run:
  mlflow_tracking_uri: "arn:aws:sagemaker:us-east-1:925548216816:mlflow-app/app-B6XOUNCHQM4W" 
  mlflow_experiment_name: "myuser-sft-lora-exp1"
  mlflow_run_name: "myuser-sft-lora-exp1-202512181940"
```

### Submitting a job via the SageMaker Studio UI


MLFlow integration is already built into the SageMaker Studio UI experience. When submitting a training job, simply indicate which MLFlow App instance to use.

1. In SageMaker Studio, navigate to Models > Nova 2.0 Lite > Customize > Customize with UI.

1. Expand the Advanced Configuration section

1. Select the MLFlow App where you would like to send the training metrics. You can also set your experiment name and experiment run here.

![\[the ML Flow App.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/MLFlow-App.png)


### Submitting a job via the AWS CLI


If you use the AWS CLI, you must create an MLflow app and pass it as an input to the training job API request.

```
mlflow_app_name="<enter your MLflow app name>"
role_arn="<enter your role ARN>"
bucket_name="<enter your bucket name>"
region="<enter your region>"

mlflow_app_arn=$(aws sagemaker create-mlflow-app \
  --name $mlflow_app_name \
  --artifact-store-uri "s3://$bucket_name" \
  --role-arn $role_arn \
  --region $region)
```

## Create an MLflow app


**Using Studio UI:** If you create a training job through the Studio UI, a default MLflow app is created automatically and selected by default under Advanced Options.

**Using CLI:** If you use the CLI, you must create an MLflow app and pass it as an input to the training job API request.

```
mlflow_app_name="<enter your MLflow app name>"
role_arn="<enter your role ARN>" 
bucket_name="<enter your bucket name>" 
region="<enter your region>"

mlflow_app_arn=$(aws sagemaker create-mlflow-app \
  --name $mlflow_app_name \
  --artifact-store-uri "s3://$bucket_name" \
  --role-arn $role_arn \
  --region $region)
```

## Access the MLflow app


**Using CLI:** Create a pre-signed URL to access the MLflow app UI:

```
aws sagemaker create-presigned-mlflow-app-url \
  --arn $mlflow_app_arn \
  --region $region \
  --output text
```

**Using Studio UI:** The Studio UI displays key metrics stored in MLflow and provides a link to the MLflow app UI.

## Key metrics to track


Monitor these metrics across iterations to assess improvement and track the job progress:

**For SFT**
+ Training loss curves
+ Number of samples consumed and time to process samples
+ Performance accuracy on held-out test sets
+ Format compliance (e.g., valid JSON output rate)
+ Perplexity on domain-specific evaluation data

**For RFT**
+ Average reward scores over training
+ Reward distribution (percentage of high-reward responses)
+ Validation reward trends (watch for over-fitting)
+ Task-specific success rates (e.g., code execution pass rate, math problem accuracy)

**General**
+ Benchmark performance deltas between iterations
+ Human evaluation scores on representative samples
+ Production metrics (if deploying iteratively)

## Determining when to stop


Stop iterating when:
+ **Performance plateaus**: Additional training no longer meaningfully improves target metrics
+ **Technique switching helps**: If one technique plateaus, try switching (e.g., SFT → RFT → SFT) to break through performance ceilings
+ **Target metrics achieved**: Your success criteria are met
+ **Regression detected**: New iterations degrade performance (see rollback procedures below)

For detailed evaluation procedures, refer to the **Evaluation** section.

# Evaluating your SageMaker AI-trained model
Evaluation

The purpose of the evaluation process is to assess trained-model performance against benchmarks or custom dataset. The evaluation process typically involves steps to create evaluation recipe pointing to the trained model, specify evaluation datasets and metrics, submit a separate job for the evaluation, and evaluate against standard benchmarks or custom data. The evaluation process will output performance metrics stored in your Amazon S3 bucket.

**Note**  
The evaluation process described in this topic is an offline process. The model is tested against fixed benchmarks with predefined answers, rather than being assessed in real-time or through live user interactions. For real-time evaluation, you can test the model after it has been deployed to Amazon Bedrock by calling [Amazon Bedrock](https://docs.aws.amazon.com//bedrock/latest/userguide/import-with-create-custom-model.html) Runtime APIs.

**Topics**
+ [

## Prerequisites
](#nova-model-evaluation-prerequisites)
+ [

## Available benchmark tasks
](#nova-model-evaluation-benchmark)
+ [

## Evaluation specific configurations
](#nova-model-evaluation-config)
+ [

## Running evaluation training jobs
](#nova-model-evaluation-notebook)
+ [

## Assessing and analyzing evaluation results
](#nova-model-evaluation-assess)
+ [

## Evaluation best practices and troubleshooting
](#nova-model-evaluation-best-practices)
+ [

## Available subtasks
](#nova-model-evaluation-subtasks)
+ [

# Reasoning model evaluation
](nova-reasoning-model-evaluation.md)
+ [

# RFT evaluation
](nova-rft-evaluation.md)
+ [

# Implementing reward functions
](nova-implementing-reward-functions.md)

## Prerequisites
Prerequisites

Before you start a evaluation training job, note the following.
+ A SageMaker AI-trained Amazon Nova model which you want to evaluate its performance.
+ Base Amazon Nova recipe for evaluation. For more information, see [Getting Amazon Nova recipes](nova-model-recipes.md#nova-model-get-recipes).

## Available benchmark tasks
Available benchmark tasks

A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of available industry standard benchmarks supported. You can specify the following benchmarks in the `eval_task` parameter.

**Available benchmarks for model evaluation**


| Benchmark | Modality | Description | Metrics | Strategy | Subtask available | 
| --- | --- | --- | --- | --- | --- | 
| mmlu |  Text  |  Multi-task Language Understanding – Tests knowledge across 57 subjects.  |  accuracy  | zs\$1cot | Yes | 
| mmlu\$1pro | Text |  MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering.  | accuracy | zs\$1cot | No | 
| bbh | Text |  Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills.  | accuracy | fs\$1cot | Yes | 
| gpqa | Text |  General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities.  | accuracy | zs\$1cot | No | 
| math | Text |  Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems.  | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text |  Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content.  | deflection | zs | Yes | 
| ifeval | Text |  Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification.  | accuracy | zs | No | 
| gen\$1qa | Multi-Modal (image) |  Custom Dataset Evaluation – Lets you supply your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. `gen_qa` supports image inference for Amazon Nova Lite or Amazon Nova Pro based models. Also supports Bring-Your-Own Metrics lambda. (For RFT evaluation, please use RFT eval recipe)  | all | gen\$1qa | No | 
| mmmu | Multi-Modal |  Massive Multidiscipline Multimodal Understanding (MMMU) – College-level benchmark comprising multiple-choice and open-ended questions from 30 disciplines.)  | accuracy | zs\$1cot | Yes | 
| llm\$1judge | Text |  LLM-as-a-Judge Preference Comparison – Uses a Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A.  | all | judge | No | 
|  mm\$1llm\$1judge  | Multi-Modal (image) |  This new benchmark behaves the same as the text-based `llm_judge`above. The only difference is that it supports image inference.  | all | judge | No | 
|  rubric\$1llm\$1judge  |  Text  |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  |  Text  |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  |  No  | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 
|  humaneval  | Text |  HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models  |  pass@1  | zs | No | 

## Evaluation specific configurations


Below is a breakdown of the key components in the recipe and guidance on how to modify them for your use cases.

### Understanding and modifying your recipes


**General run configuration**

```
run:
  name: eval_job_name 
  model_type: amazon.nova-2-lite-v1:0:256k 
  model_name_or_path: nova-lite-2/prod # or s3://escrow_bucket/model_location
  replicas: 1 
  data_s3_path: ""
  mlflow_tracking_uri: "" 
  mlflow_experiment_name : "" 
  mlflow_run_name : ""
```
+ `name`: A descriptive name for your evaluation job.
+ `model_type`: Specifies the Nova model variant to use. Do not manually modify this field. Options include:
  + amazon.nova-micro-v1:0:128k
  + amazon.nova-lite-v1:0:300k
  + amazon.nova-pro-v1:0:300k
  + amazon.nova-2-lite-v1:0:256k
+ `model_name_or_path`: The path to the base model or s3 path for post trained checkpoint. Options include:
  + nova-micro/prod
  + nova-lite/prod
  + nova-pro/prod
  + nova-lite-2/prod
  + S3 path for post trained checkpoint path (`s3:customer-escrow-111122223333-smtj-<unique_id>/<training_run_name>`)
**Note**  
**Evaluate post-trained model**  
To evaluate a post-trained model after a Nova SFT training job, follow these steps after running a successful training job. At the end of the training logs, you will see the log message "Training is complete". You will also find a `manifest.json` file in your output bucket containing the location of your checkpoint. This file will be located within an `output.tar.gz` file at your output S3 location. To proceed with evaluation, use this checkpoint by setting it as the value for `run.model_name_or_path` in your recipe configuration.
+ `replica`: The number of compute instances to use for distributed inference (running inference across multiple nodes). Set `replica` > 1 to enable multi-node inference, which accelerates evaluation. If both `instance_count` and `replica` are specified, `instance_count` takes precedence. Note that multiple replicas only apply to SageMaker AI training jobs, not . 
+ `data_s3_path`: The input dataset Amazon S3 path. This field is required but should always left empty.
+ `mlflow_tracking_uri`: (Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

**Evaluation configuration**

```
evaluation:
  task: mmlu 
  strategy: zs_cot 
  subtask: abstract_algebra
  metric: accuracy
```
+ `task`: Specifies the evaluation benchmark or task to use. Supported task includes:
  + `mmlu`
  + `mmlu_pro`
  + `bbh`
  + `gpqa`
  + `math`
  + `strong_reject`
  + `gen_qa`
  + `ifeval`
  + `mmmu`
  + `llm_judge`
  + `mm_llm_judge`
  + `rubric_llm_judge`
  + `aime_2024`
  + `calendar_scheduling`
  + `humaneval`
+ `strategy`: Defines the evaluation approach.
  + `zs_cot`: Zero-shot Chain of Thought - an approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + `fs_cot`: Few-shot Chain of Thought - an approach that provides a few examples of step-by-step reasoning before asking the model to solve a new problem.
  + `zs`: Zero-shot - an approach to solve a problem without any prior training examples.
  + `gen_qa`: Strategy specific for bring your own dataset.
  + `judge`: Strategy specific for Nova LLM as Judge and `mm_llm_judge`.
+ `subtask`: Optional. Specific components of the evaluation task. For a complete list of available subtasks, see [Available subtasks](#nova-model-evaluation-subtasks).
  + Check supported subtasks in Available benchmarks tasks.
  + Should remove this field if there are no subtasks benchmarks.
+ `metric`: The evaluation metric to use.
  + `accuracy`: Percentage of correct answers.
  + `exact_match`: For math benchmark, returns the rate at which the input predicted strings exactly match their references.
  + `deflection`: For strong reject benchmark, returns relative deflection to base model and difference significance metrics.
  + `all`:

    For `gen_qa`, bring your own dataset benchmark, return following metrics:
    + `rouge1`: Measures overlap of unigrams (single words) between generated and reference text.
    + `rouge2`: Measures overlap of bigrams (two consecutive words) between generated and reference text.
    + `rougeL`: Measures longest common subsequence between texts, allowing for gaps in the matching.
    + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
    + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
    + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
    + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
    + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.

    For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
    + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
    + `a_scores_stderr`: Standard error of `response_A_scores` across pairwise judgements.
    + `b_scores`: Measures Number of wins for `response_B` across forward and backward evaluation passes.
    + `a_scores_stderr`: Standard error of `response_B_scores` across pairwise judgements.
    + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
    + `ties_stderr`: Standard error of `ties` across pairwise judgements.
    + `inference_error`: Count of judgements that could not be properly evaluated.
    + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
    + `score_stderr`: Aggregate score based on wins from both forward and backward passes for `response_B`.
    + `inference_error_stderr`: Standard error of the aggregate score across pairwise judgements.
    + `winrate`: The probability that `response_B` will be preferred over `response_A` calculated using Bradley-Terry probability.
    + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.
    + `upper_rate`: Upper bound (97.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration (optional)**

```
inference:
  max_new_tokens: 2048 
  top_k: -1 
  top_p: 1.0 
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: Maximum number of tokens to generate. Must be an integer. (Unavailable for LLM Judge)
+ `top_k`: Number of the highest probability tokens to consider. Must be an integer.
+ `top_p`: Cumulative probability threshold for token sampling. Must be a float between 1.0 to 0.0.
+ `temperature`: Randomness in token selection (higher = more random), keep 0 to make the result deterministic. Float type, minimal value is 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are null (default value if not set; disables reasoning), low, or high.

### Log Probability Output Format


When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

### Evaluation recipe examples


Amazon Nova provides four different types of evaluation recipes. All recipes are available in [SageMaker HyperPod recipes GitHub repository](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).

**Topics**

#### General text benchmark recipes


These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. 

Recipe format: `xxx_general_text_benchmark_eval.yaml`.

#### Bring your own dataset benchmark recipes


These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. 

Recipe format: `xxx_ bring_your_own_dataset_eval.yaml`.

**Bring your own dataset requirements**

File format: 
+ Single `gen_qa.jsonl` file containing evaluation examples. The file name should be exact `gen_qa.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access.
+ The file must follow the required schema format for general Q&A dataset.

Schema format requirements - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `query`: String containing the question or instruction that needs an answer.

  `response`: String containing the expected model output.
+ Optional fields.

  `system`: String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query.

  `images`: Array containing a list of objects with data attributes (Base64 encoded image strings).

  `metadata`: String containing metadata associated with the entry for tagging purposes.

**Example entry**

```
{
"system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}{
"system":"You are a pattern analysis specialist who provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}{
"system":"You have great attention to detail and follow instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}{
"system": "Image inference: ",
  "query": "What is the number in the image? Please just use one English word to answer.",
  "response": "two",
  "images": [
    {
      "data": "data:image/png;Base64,iVBORw0KGgoA ..."
    }
  ]
}
```

To use your custom dataset, modify your evaluation recipe by adding the following required fields without changing the existing configuration:

```
evaluation:
  task: gen_qa 
  strategy: gen_qa 
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.

##### Bring your own metrics


You can bring your own metrics to fully customize your model evaluation workflow with custom preprocessing, postprocessing, and metrics capabilities. Preprocessing allows you to process input data before sending it to the inference server, and postprocessing allows you to customize metrics calculation and return custom metrics based on your needs.

Follow these steps to bring your own metrics with custom evaluation SDK.

1. If you haven't done so, [create an AWS Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/getting-started.html) in your AWS account first.

1. Download the pre-built `nova-custom-eval-layer.zip` file from the [GitHub repository](https://github.com/aws/nova-custom-eval-sdk/releases). You can use this open-source Nova custom evaluation SDK to validate input and output payloads for your custom function and provide a unified interface for integrating with Nova's bring your own metrics evaluation during training.

1. Upload the custom Lambda layer using the following command:

   ```
   aws lambda publish-layer-version \
       --layer-name nova-custom-eval-layer \
       --zip-file fileb://nova-custom-eval-layer.zip \
       --compatible-runtimes python3.12 python3.11 python3.10 python3.9
   ```

1. Add this layer as a custom layer to your Lambda function, along with the required AWS layer: `AWSLambdaPowertoolsPythonV3-python312-arm64` (required for `pydantic` dependency).

1. Update your Lambda code using the provided example, modifying the code as needed. This example code creates a Lambda function for Nova's custom evaluation with preprocessing and postprocessing steps for model evaluation.

   ```
   from nova_custom_evaluation_sdk.processors.decorators import preprocess, postprocess
   from nova_custom_evaluation_sdk.lambda_handler import build_lambda_handler
   
   @preprocess
   def preprocessor(event: dict, context) -> dict:
       data = event.get('data', {})
       return {
           "statusCode": 200,
           "body": {
               "system": data.get("system"),
               "prompt": data.get("prompt", ""),
               "gold": data.get("gold", "")
           }
       }
   
   @postprocess
   def postprocessor(event: dict, context) -> dict:
       # data is already validated and extracted from event
       data = event.get('data', [])
       inference_output = data.get('inference_output', '')
       gold = data.get('gold', '')
       
       metrics = []
       inverted_accuracy = 0 if inference_output.lower() == gold.lower() else 1.0
       metrics.append({
           "metric": "inverted_accuracy_custom",
           "value": accuracy
       })
       
       # Add more metrics here
       
       return {
           "statusCode": 200,
           "body": metrics
       }
   
   # Build Lambda handler
   lambda_handler = build_lambda_handler(
       preprocessor=preprocessor,
       postprocessor=postprocessor
   )
   ```

1. Grant Lambda access to the evaluation job. Ensure the execution role specified for the evaluation job includes a policy the invoke your Lambda function. Here is an example policy.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "LambdaAccess",
               "Effect": "Allow",
               "Action": [
                   "lambda:InvokeFunction"
               ],
               "Resource": "arn:aws:lambda:us-east-1:111122223333:function:ExampleFunction",
               "Condition": {
                   "StringLike": {
                       "aws:PrincipalArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-ARN"
                   }
               }
           },
           {
               "Sid": "DenyNonAWSEventSourcesForLambda",
               "Effect": "Deny",
               "Action": [
                   "lambda:InvokeFunction"
               ],
               "Resource": "arn:aws:lambda:us-east-1:111122223333:function:ExampleFunction",
               "Condition": {
                   "Null": {
                       "lambda:EventSourceToken": false
                   }
               }
           }
       ]
   }
   ```

------

1. Review the Lambda payload schema. The following table lists the Lambda request and response schema. You can validate your schema using the Nova custom evaluation SDK.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/nova-model-evaluation.html)

1. Modify the recipe file. Here is an example. 

   ```
   processor:
     lambda_arn: arn:aws:lambda:us-east-1:111122223333:function:name
     lambda_type: "custom_metrics"
     preprocessing:
       enabled: true
     postprocessing:
       enabled: true
     aggregation: average
   ```
   + `lambda-arn`: The Amazon Resource Name (ARN) for your Lambda function that handles preprocessing and postprocessing.
   + `lambda_type`: "custom\$1metrics" or "rft".
   + `preprocessing`: Whether to enable custom pre-processing operations.
   + `postprocessing`: Whether to enable custom post-processing operations.
   + `aggregation`: Built-in aggregation function (valid options: min, max, average, sum).

**Limitations**
+ Bring your own metrics only applies to text input datasets.
+ Multi-modal input datasets are not supported. 
+ The preprocessing step does not process the metadata field.

#### Nova LLM as a Judge benchmark recipes


Nova LLM Judge is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, then uses a Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_llm_judge_eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `llm_judge.jsonl` file containing evaluation examples. The file name should be exact `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access.
+ The file must follow the required schema format for the `llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova Judge models are the same across micro / lite / pro specifications.
+ Custom judge models are not currently supported.

##### Nova LLM as a Judge for multi-modal (image) benchmark recipes


Nova LLM Judge for multi-modal (image), short for Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in thte form of Base64-encoded string, then uses a Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge_eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker AI training jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "What is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
} 
{
  "prompt": "How many animals are in each of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

#### Evaluating CPT (Continuous Pre-Training) Checkpoints


Evaluation on CPT (continuous pre-trained) models can be more of a challenge than models which have undergone SFT (supervised fine tuning) as CPT models regularly lack the ability to follow instructions. Instead of following instructions, CPT model will operate as completion models, meaning they will only attempt to continue the pattern provided with the input token sequence. Given this limitation, typical evaluation datasets may not work correctly due to the "Q&A" format of the inputs — instead of answering the question the model will simply try to continue along the same question. However, by formatting datasets in a specific way to prompt models in a completion style, we can get an understanding of how the model is performing.

Follow the below steps to understand how to perform an evaluation on a continuous pre-trained model using the Nova Forge evaluation workflow.

##### Dataset Preparation and Formatting


Evaluating a CPT model takes advantage of the already existing [Bring Your Own Dataset](#nova-model-evaluation-config-byo) workflow already provided in the Nova Forge model evaluation experience. However, queries within the dataset must be formatted in a purely "completion" format as CPT models will not respond to standard question-style prompts in the same manner an SFT model would.

Another important frequent limitation of models which have undergone solely undergone CPT is their inability to generate STOP or end of sequence tokens — this means that the model will continue to generate tokens until it is forcefully stopped (such as with the max\$1new\$1tokens parameter). Given this limitation, best practice is to evaluate the models using single token responses (such as multiple choice) to ensure the model doesn't continue to generate junk output that is not needed after prompting.

For example, a typical evaluation dataset (such as MMLU, GPQA, MATH, etc), might prompt the model with a question such as:

```
Early settlements and high population density along coastlines and rivers are 
best attributed to which of the following?
A: "Poor climate conditions"
B: "Limited forest cover"
C: "Cars"
D: "Access to trade routes" 

(Expected answer is D.)
```

However, a CPT model would not understand how to properly respond to this question due to the lack of fine tuning on instruction following. Therefore, CPT models must be prompted in a completion style, such as:

```
Early settlements and high population density along coastlines and rivers 
are best attributed to which of the following?
A: Poor climate conditions
B: Limited forest cover
C: Cars
D: Access to trade routes
The correct answer to this question is option 

(Expected answer is D.)
```

After inference checking the output logprobs generated by the model will provide details on if the model was correctly processing the input and generating the correct response. It is not guaranteed that the model will always produce the exact response (in this case, the letter D) expected, however, it should be be present within the logprobs if the model is functioning correctly.

Another non-multiple choice completion style prompt example:

```
The capital of France is

(Expected answer of Paris)
```

We would expect the model to either produce a response of "Paris" or see the token corresponding to "Paris" somewhere in the logprobs output.

##### Dataset Formatting


CPT evaluation takes advantage of the existing [Bring Your Own Dataset](#nova-model-evaluation-config-byo) workflow. Data must be formatted in the "query response" format in a JSONL file separated by new lines.

An example of a dataset with 4 entries in it:

```
{"query": "The capital of France is", "response": "Paris"}
{"query": "2 + 2 =", "response": "4"}
{"query": "The mitochondria is the powerhouse of the", "response": "cell"}
{"query": "What is the largest planet?\nA: Mars\nB: Jupiter\nC: Saturn\nD: Earth\nAnswer:", "response": "B"}
```

Each line must contain:
+ `query`: The prompt text for completion
+ `response`: The expected completion (ground truth)

The model will receive raw text input without chat formatting. CPT models are typically not trained on special tokens yet and will not respond properly to chat templates, so when prompting the model ONLY the string provided in the query will be sent to the model (with an additional `[BOS]` token prepended to it automatically.)

##### Recipe Configuration


Here is an example of a recipe which is configured for evaluating a CPT model:

```
run:
  name: cpt_eval_job
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: s3://bucket/path/to/cpt-checkpoint/

evaluation:
  task: gen_qa  # Required for CPT - bring your own dataset
  strategy: gen_qa
  metric: all  # Returns rouge1, rouge2, rougeL, exact_match, f1_score, bleu

inference:
  checkpoint_is_instruction_tuned: "false"  # Required for CPT checkpoints
  top_logprobs: 5 # Change to desired amount of logprobs to calculate
  max_new_tokens: 1 # Keep low for completion tasks
  temperature: 0.0
```

Key changes for CPT evaluation:
+ `checkpoint_is_instruction_tuned: "false"`

  This is a new parameter added specifically to support evaluation runs on CPT checkpoints. Setting checkpoint\$1is\$1instruction\$1tuned to false will **disable** the default Amazon Nova chat template that normally wraps the input prompt.
+ `top_logprobs: 5`

  Log probabilities (logprobs) reveal the model's confidence distribution across possible next tokens, helping you assess whether the model has learned the expected completions during pre-training. Typically, if the model is performing as intended, we should see the expected response (i.e, "A", "B", etc) as a either the generated output token or a token in the logprobs.
+ `max_new_tokens: 1`

  CPT models typically have not been trained yet on generating special "stop" or "end of sequence" tokens yet to signal when to stop inference. This means the model will typically continue to generate new tokens until the specific max token length is reached resulting in unnecessary inference. Typically limiting the max\$1new\$1tokens to 1 and providing a prompt which can evaluate the model on a single response (like a multiple choice question) is the most efficient way to prompt the model. Setting the max\$1new\$1tokens to 1 ensures that extra junk tokens aren't generated.

##### Key Parameters

+ **checkpoint\$1is\$1instruction\$1tuned**: Must be set to `"false"` (or `false` as boolean) to disable chat templates
+ **top\$1logprobs**: 5, recommended to see how the model is learning during CPT
+ **task**: Must be `gen_qa` - CPT models cannot use instruction-following benchmarks like MMLU or MATH
+ **strategy**: Must be `gen_qa`
+ **max\$1new\$1tokens**: Recommended to keep low (1-5) since CPT models perform completion, not generation

## Running evaluation training jobs
Evaluation training jobs

Start a training job using the following sample notebook. Please refer to below notebook as example to run the evaluation training job. For more information, see [Use a SageMaker AI estimator to run a training job](https://docs.aws.amazon.com//sagemaker/latest/dg/docker-containers-adapt-your-own-private-registry-estimator.html).

### Reference tables


Before running the notebook, refer to the following reference tables to select image URI and instance configurations.

**Selecting image URI**


| Recipe | Image URI | 
| --- | --- | 
|  Evaluation image URI  | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest | 

**Selecting instance type and count**


| Model | Job type | Instance type | Recommended instance count | Allowed instance count | 
| --- | --- | --- | --- | --- | 
| Amazon Nova Micro | Evaluation (SFT/DPO) |  g5.12xlarge  | 1 | 1 - 16 | 
| Amazon Nova Lite | Evaluation (SFT/DPO) |  g5.12xlarge  | 1 | 1 - 16 | 
| Amazon Nova Pro | Evaluation (SFT/DPO) |  p5.48xlarge  | 1 | 1 - 16 | 

### Sample notebook


The following sample notebook demonstrates how to run an evaluation training job.

```
# install python SDK

# Do not use sagemaker v3, as sagemaker v3 introduced breaking changes

!pip install sagemaker==2.254.1
 
import os
import sagemaker,boto3
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Download recipe from https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/evaluation/nova to local
# Assume the file name be `recipe.yaml`

# Populate parameters
# input_s3_uri = "s3://<path>/input/" # (Optional) Only used for multi-modal dataset or bring your own dataset s3 location
output_s3_uri= "s3://<path>/output/" # Output data s3 location, a zip containing metrics json and tensorboard metrics files will be stored to this location
instance_type = "instance_type"  # ml.g5.16xlarge as example
instance_count = 1 # The number of instances for inference (set instance_count > 1 for multi-node inference to accelerate evaluation)             
job_name = "your job name"
recipe_path = "recipe path" # ./recipe.yaml as example
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest" # Do not change
output_kms_key = "<KMS key arn to encrypt trained model in Amazon-owned S3 bucket>" # optional, leave blank for Amazon managed encryption

# (Optional) To bring your own dataset and LLM judge for evaluation
# evalInput = TrainingInput(
# s3_data=input_s3_uri,
# distribution='FullyReplicated',
# s3_data_type='S3Prefix'
#)

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    instance_count=instance_count,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    output_kms_key=output_kms_key
)
estimator.fit()

# If input dataset exist, pass in inputs
# estimator.fit(inputs={"train": evalInput})
```

## Assessing and analyzing evaluation results
Assessing evaluation results

After your evaluation job completes successfully, you can assess and analyze the results using the following steps.

**To assess and analyze the results, following these steps.**

1. Understand the output location structure. Results are stored in your specified Amazon S3 output location as a compressed file:

   ```
   s3://your-bucket/output/benchmark-name/
   └── job_name/
       └── output/
           └── output.tar.gz
   ```

1. Download the `output.tar.gz` file from your bucket. Extract the contents to reveal.

   ```
   run_name/
   ├── eval_results/
   |   └── results_[timestamp].json
   │   └── inference_output.jsonl (only present for gen_qa)
   |   └── details/
   |         └── model/
   |              └── <execution-date-time>/
   |                    └──details_<task_name>_#_<datetime>.parquet
   └── tensorboard_results/
       └── eval/
           └── events.out.tfevents.[timestamp]
   ```
   + `results_[timestamp].json` - Output metrics JSON file
   + `details_<task_name>_#_<datetime>.parquet` - Inference output file (except for `strong_reject`)
   + `events.out.tfevents.[timestamp]` - TensorBoard output file
   + `inference_output.jsonl` - Cleaned inference output file (only for `gen_qa` tasks)

1. View results in TensorBoard. To visualize your evaluation metrics: 

   1. Upload the extracted folder to an S3 bucket

   1. Navigate to SageMaker AI TensorBoard

   1. Select your "S3 folders"

   1. Add the S3 folder path

   1. Wait for synchronization to complete

1. Analyze inference outputs. All evaluation tasks, except `llm_judge` and `strong_reject`, will have the following fields for analysis in the inference output.
   + `full_prompt` - The full user prompt sent to the model used for the evaluation task.
   + `gold` - The field that contains the correct answer(s) as specified by the dataset.
   + `metrics` - The field that contains the metrics evaluated on the individual inference. Values that require aggregation would not have a value on the individual inference outputs.
   + `predictions` - The field that contains a list of the model’s output for the given prompt.
   + `pred_logits` - The field that contains the considered output tokens and log probabilities of each output token returned in the message content.

   By looking at these fields, you can determine the cause for metric differences and understand the behavior of the customized models.

   For `llm_judge`, the inference output file contains the following fields under the metrics field per pair of evaluations.
   + `forward_output` - Judge's raw preferences when evaluating in order (response\$1A, response\$1B).
   + `backward_output` - Judge's raw preferences when evaluating in reverse order (response\$1B, response\$1A).
   + `Pairwise metrics` - Metrics that are calculated per pair of forward and backward evaluation including `a_scores`, `b_scores`, `ties`, `inference-score` and `score`.
**Note**  
Aggregate metrics like `winrate` are only available in the summary results files, not per individual judgement.

   For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
   + prompt - The final prompt submitted to the model
   + inference - The raw inference output from the model
   + gold - The target response from the input dataset
   + metadata - The metadata string from the input dataset if provided

## Evaluation best practices and troubleshooting
Best practices and troubleshooting

### Best practices
Best practices

The following lists some best practices for the evaluation process.
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

### Troubleshooting
Troubleshooting

You can use CloudWatch log group `/aws/sagemaker/TrainingJobs` for training job error logs.

#### Engine core Failure


**Issue**: 

If you are seeing: 

```
RuntimeError: Engine core initialization failed.
```

**Cause**: 

Although this is a general error that can have multiple causes, it typically occurs when there is a mismatch between the model checkpoint you're trying to load and the model type specified. E.g. you want to evaluate a fine-tuned Nova 2.0 lite model checkpoint but the model type you provide is 1.0 model type. e.g. `amazon.nova-micro-v1:0:128k`

The correct mapping should be 

```
model_type: amazon.nova-2-lite-v1:0:256k
model_name_or_path: nova-lite-2/prod # or s3://escrow_bucket/model_location
```

**Prevention**: 

Double check the `model_name_or_path` is mapped to the right `model_type` before submitting the evaluation job.

## Available subtasks
Available subtasks

The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), MATH, and MMMU (Massive Multi-discipline Multimodal Understanding). These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.

**MMLU**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**Math**

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
```

**MMMU**

```
            MATH_SUBTASKS = [
    "Accounting",
    "Agriculture",
    "Architecture_and_Engineering",
    "Art",
    "Art_Theory",
    "Basic_Medical_Science",
    "Biology",
    "Chemistry",
    "Clinical_Medicine",
    "Computer_Science",
    "Design",
    "Diagnostics_and_Laboratory_Medicine",
    "Economics",
    "Electronics",
    "Energy_and_Power",
    "Finance",
    "Geography",
    "History",
    "Literature",
    "Manage",
    "Marketing",
    "Materials",
    "Math",
    "Mechanical_Engineering",
    "Music",
    "Pharmacy",
    "Physics",
    "Psychology",
    "Public_Health",
    "Sociology",
```

Evaluate your customized Nova models using various evaluation methods and metrics.

**Topics**
+ [

## Prerequisites
](#nova-model-evaluation-prerequisites)
+ [

## Available benchmark tasks
](#nova-model-evaluation-benchmark)
+ [

## Evaluation specific configurations
](#nova-model-evaluation-config)
+ [

## Running evaluation training jobs
](#nova-model-evaluation-notebook)
+ [

## Assessing and analyzing evaluation results
](#nova-model-evaluation-assess)
+ [

## Evaluation best practices and troubleshooting
](#nova-model-evaluation-best-practices)
+ [

## Available subtasks
](#nova-model-evaluation-subtasks)
+ [

# Reasoning model evaluation
](nova-reasoning-model-evaluation.md)
+ [

# RFT evaluation
](nova-rft-evaluation.md)
+ [

# Implementing reward functions
](nova-implementing-reward-functions.md)

# Reasoning model evaluation
Reasoning model evaluation

## Overview


Reasoning model support enables evaluation with reasoning-capable Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**
+ amazon.nova-2-lite-v1:0:256k

## Recipe configuration


Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:  
  name: reasoning-eval-job-name                          # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model  
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker HyperPod job  
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker HyperPod job (not compatible with SageMaker Training jobs)  
  
evaluation:  
  task: mmlu                                             # [MODIFIABLE] Evaluation task  
  strategy: zs_cot                                       # [MODIFIABLE] Evaluation strategy  
  metric: accuracy                                       # [MODIFIABLE] Metric calculation method  
  
inference:  
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/high or null to disable  
  max_new_tokens: 32768                                  # [MODIFIABLE] Maximum tokens to generate, recommended value when reasoning_effort set to high  
  top_k: -1                                              # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

## Using the reasoning\$1effort parameter


The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

### Prerequisites

+ **Model compatibility** – Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling** – Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

### Available options



| Option | Behavior | Token limit | Use case | 
| --- | --- | --- | --- | 
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 


| Training method | Available options | How to configure | 
| --- | --- | --- | 
| SFT (Supervised Fine-Tuning) | High or Off only | Use reasoning\$1enabled: true (high) or reasoning\$1enabled: false (off) | 
| RFT (Reinforcement Fine-Tuning) | Low, High, or Off | Use reasoning\$1effort: low or reasoning\$1effort: high. Omit field to disable. | 
| Evaluation | Low, High, or Off | Use reasoning\$1effort: low or reasoning\$1effort: high. Use null to disable. | 

### When to enable reasoning


**Use reasoning mode (`low` or `high`) for**
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (`null` or omit parameter) for**
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

### Troubleshooting


**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# RFT evaluation
RFT evaluation

## What is RFT evaluation?


RFT Evaluation allows you to assess your model's performance using custom reward functions before, during, or after reinforcement learning training. Unlike standard evaluations that use pre-defined metrics, RFT Evaluation lets you define your own success criteria through a Lambda function that scores model outputs based on your specific requirements.

## Why evaluate with RFT?


Evaluation is crucial to determine whether the RL fine-tuning process has:
+ Improved model alignment with your specific use case and human values
+ Maintained or improved model capabilities on key tasks
+ Avoided unintended side effects such as reduced factuality, increased verbosity, or degraded performance on other tasks
+ Met your custom success criteria as defined by your reward function

## When to use RFT evaluation


Use RFT Evaluation in these scenarios:
+ Before RFT Training: Establish baseline metrics on your evaluation dataset
+ During RFT Training: Monitor training progress with intermediate checkpoints
+ After RFT Training: Validate that the final model meets your requirements
+ Comparing Models: Evaluate multiple model versions using consistent reward criteria

**Note**  
Use RFT Evaluation when you need custom, domain-specific metrics. For general-purpose evaluation (accuracy, perplexity, BLEU), use standard evaluation methods.

## Data format requirements


### Input data structure


RFT evaluation input data must follow the OpenAI Reinforcement Fine-Tuning format. Each example is a JSON object containing:
+ `messages` – Array of conversational turns with `system` and `user` roles
+ `reference_answer` – Expected output or ground truth data used by your reward function for scoring

### Data format example


```
{  
  "messages": [  
    {  
      "role": "user",  
      "content": [  
        {  
          "type": "text",  
          "text": "Solve for x. Return only JSON like {\"x\": <number>}. Equation: 2x + 5 = 13"  
        }  
      ]  
    }  
  ],  
  "reference_answer": {  
    "x": 4  
  }  
}
```

### Current limitations

+ Text only: No multimodal inputs (images, audio, video) are supported
+ Single-turn conversations: Only supports single user message (no multi-turn dialogues)
+ JSON format: Input data must be in JSONL format (one JSON object per line)
+ Model outputs: Evaluation is performed on generated completions from the specified model

## Preparing your evaluation recipe


### Sample notebook


For a complete example, see [Evaluation notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook).

### Sample recipe configuration


```
run:  
  name: nova-lite-rft-eval-job    
  model_type: amazon.nova-lite-v1:0:300k    
  model_name_or_path: s3://escrow_bucket/model_location # [MODIFIABLE] S3 path to your model or model identifier  
  replicas: 1 # [MODIFIABLE] For SageMaker Training jobs only; fixed for HyperPod jobs  
  data_s3_path: "" # [REQUIRED FOR HYPERPOD] Leave empty for SageMaker Training jobs and use TrainingInput in sagemaker python SDK  
  output_s3_path: "" # [REQUIRED] Output artifact S3 path for evaluation results  
  
evaluation:  
  task: rft_eval # [FIXED] Do not modify  
  strategy: rft_eval # [FIXED] Do not modify  
  metric: all # [FIXED] Do not modify  
  
# Inference Configuration  
inference:  
  max_new_tokens: 8192 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1 # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0 # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0 # [MODIFIABLE] Sampling temperature (0 = deterministic)  
  top_logprobs: 0 # [MODIFIABLE] Set between 1-20 to enable logprobs output  
  
# =============================================================================  
# Bring Your Own Reinforcement Learning Environment  
# =============================================================================  
rl_env:  
  reward_lambda_arn: arn:aws:lambda:<region>:<account_id>:function:<reward-function-name>
```

## Preset reward functions


Two preset reward functions (`prime_code` and `prime_math`) are available as a Lambda layer for easy integration with your RFT Lambda functions.

### Overview


These preset functions provide out-of-the-box evaluation capabilities for:
+ `prime_code` – Code generation and correctness evaluation
+ `prime_math` – Mathematical reasoning and problem-solving evaluation

### Quick setup


1. Download the Lambda layer from the [nova-custom-eval-sdk releases](https://github.com/aws/nova-custom-eval-sdk/releases).

1. Publish Lambda layer using AWS Command Line Interface (AWS CLI):

   ```
   aws lambda publish-layer-version \
       --layer-name preset-function-layer \
       --description "Preset reward function layer with dependencies" \
       --zip-file fileb://universal_reward_layer.zip \
       --compatible-runtimes python3.9 python3.10 python3.11 python3.12 \
       --compatible-architectures x86_64 arm64
   ```

1. Add the layer to your Lambda function in AWS Management Console (Select the preset-function-layer from custom layer and also add AWSSDKPandas-Python312 for numpy dependencies).

1. Import and use in your Lambda code:

   ```
   from prime_code import compute_score  # For code evaluation
   from prime_math import compute_score  # For math evaluation
   ```

### prime\$1code function


Evaluates Python code generation tasks by executing code against test cases and measuring correctness.

**Example input dataset format**

```
{"messages":[{"role":"user","content":"Write a function that returns the sum of two numbers."}],"reference_answer":{"inputs":["3\n5","10\n-2","0\n0"],"outputs":["8","8","0"]}}
{"messages":[{"role":"user","content":"Write a function to check if a number is even."}],"reference_answer":{"inputs":["4","7","0","-2"],"outputs":["True","False","True","True"]}}
```

**Key features**
+ Automatic code extraction from markdown code blocks
+ Function detection and call-based testing
+ Test case execution with timeout protection
+ Syntax validation and compilation checks
+ Detailed error reporting with tracebacks

### prime\$1math function


Evaluates mathematical reasoning and problem-solving capabilities with symbolic math support.

**Input format**

```
{"messages":[{"role":"user","content":"What is the derivative of x^2 + 3x?."}],"reference_answer":"2*x + 3"}
```

**Key features**
+ Symbolic math evaluation using SymPy
+ Multiple answer formats (LaTeX, plain text, symbolic)
+ Mathematical equivalence checking
+ Expression normalization and simplification

### Data format requirements


**For code evaluation**
+ Inputs: Array of function arguments (proper types: integers, strings, etc.)
+ Outputs: Array of expected return values (proper types: booleans, numbers, etc.)
+ Code: Must be in Python with clear function definitions

**For math evaluation**
+ Reference answer: Mathematical expression or numeric value
+ Response: Can be LaTeX, plain text, or symbolic notation
+ Equivalence: Checked symbolically, not just string matching

### Best practices

+ Use proper data types in test cases (integers vs strings, booleans vs "True")
+ Provide clear function signatures in code problems
+ Include edge cases in test inputs (zero, negative numbers, empty inputs)
+ Format math expressions consistently in reference answers
+ Test your reward function with sample data before deployment

### Error handling


Both functions include robust error handling for:
+ Compilation errors in generated code
+ Runtime exceptions during execution
+ Malformed input data
+ Timeout scenarios for infinite loops
+ Invalid mathematical expressions

## Creating your reward function


### Lambda ARN requirements


Your Lambda ARN must follow this format:

```
"arn:aws:lambda:*:*:function:*SageMaker*"
```

If the Lambda does not have this naming scheme, the job will fail with this error:

```
[ERROR] Unexpected error: lambda_arn must contain one of: ['SageMaker', 'sagemaker', 'Sagemaker'] when running on SMHP platform (Key: lambda_arn)
```

### Lambda request format


Your Lambda function receives data in this format:

```
[  
  {  
    "id": "sample-001",  
    "messages": [  
      {  
        "role": "user",  
        "content": [  
          {  
            "type": "text",  
            "text": "Do you have a dedicated security team?"  
          }  
        ]  
      },  
      {  
        "role": "nova_assistant",  
        "content": [  
          {  
            "type": "text",  
            "text": "As an AI developed by Company, I don't have a dedicated security team..."  
          }  
        ]  
      }  
    ],  
    "reference_answer": {  
      "compliant": "No",  
      "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
  }  
]
```

**Note**  
The message structure includes the nested `content` array, matching the input data format. The last message with role `nova_assistant` contains the model's generated response.

### Lambda response format


Your Lambda function must return data in this format:

```
[  
  {  
    "id": "sample-001",  
    "aggregate_reward_score": 0.75,  
    "metrics_list": [  
      {  
        "name": "accuracy",  
        "value": 0.85,  
        "type": "Metric"  
      },  
      {  
        "name": "fluency",  
        "value": 0.90,  
        "type": "Reward"  
      }  
    ]  
  }  
]
```

**Response fields**
+ `id` – Must match the input sample ID
+ `aggregate_reward_score` – Overall score (typically 0.0 to 1.0)
+ `metrics_list` – Array of individual metrics with:
  + `name` – Metric identifier (e.g., "accuracy", "fluency")
  + `value` – Metric score (typically 0.0 to 1.0)
  + `type` – Either "Metric" (for reporting) or "Reward" (used in training)

## IAM permissions


### Required permissions


Your SageMaker execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker execution role:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "lambda:InvokeFunction"  
      ],  
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"  
    }  
  ]  
}
```

### Lambda execution role


Your Lambda function's execution role needs basic Lambda execution permissions:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "logs:CreateLogGroup",  
        "logs:CreateLogStream",  
        "logs:PutLogEvents"  
      ],  
      "Resource": "arn:aws:logs:*:*:*"  
    }  
  ]  
}
```

If your Lambda function accesses other AWS services (e.g., S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Executing the evaluation job


1. **Prepare your data** – Format your evaluation data according to the data format requirements and upload your JSONL file to S3: `s3://your-bucket/eval-data/eval_data.jsonl`

1. **Configure your recipe** – Update the sample recipe with your configuration:
   + Set `model_name_or_path` to your model location
   + Set `lambda_arn` to your reward function ARN
   + Set `output_s3_path` to your desired output location
   + Adjust `inference` parameters as needed

   Save the recipe as `rft_eval_recipe.yaml`

1. **Run the evaluation** – Execute the evaluation job using the provided notebook: [Evaluation notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook)

1. **Monitor progress** – Monitor your evaluation job through:
   + SageMaker Console: Check job status and logs
   + CloudWatch Logs: View detailed execution logs
   + Lambda Logs: Debug reward function issues

## Understanding evaluation results


### Output format


The evaluation job outputs results to your specified S3 location in JSONL format. Each line contains the evaluation results for one sample:

```
{  
  "id": "sample-001",  
  "aggregate_reward_score": 0.75,  
  "metrics_list": [  
    {  
      "name": "accuracy",  
      "value": 0.85,  
      "type": "Metric"  
    },  
    {  
      "name": "fluency",  
      "value": 0.90,  
      "type": "Reward"  
    }  
  ]  
}
```

**Note**  
The RFT Evaluation Job Output is identical to the Lambda Response format. The evaluation service passes through your Lambda function's response without modification, ensuring consistency between your reward calculations and the final results.

### Interpreting results


**Aggregate reward score**
+ Range: Typically 0.0 (worst) to 1.0 (best), but depends on your implementation
+ Purpose: Single number summarizing overall performance
+ Usage: Compare models, track improvement over training

**Individual metrics**
+ Metric Type: Informational metrics for analysis
+ Reward Type: Metrics used during RFT training
+ Interpretation: Higher values generally indicate better performance (unless you design inverse metrics)

### Performance benchmarks


What constitutes "good" performance depends on your use case:


| Score range | Interpretation | Action | 
| --- | --- | --- | 
| 0.8 - 1.0 | Excellent | Model ready for deployment | 
| 0.6 - 0.8 | Good | Minor improvements may be beneficial | 
| 0.4 - 0.6 | Fair | Significant improvement needed | 
| 0.0 - 0.4 | Poor | Review training data and reward function | 

**Important**  
These are general guidelines. Define your own thresholds based on business requirements, baseline model performance, domain-specific constraints, and cost-benefit analysis of further training.

## Troubleshooting


### Common issues



| Issue | Cause | Solution | 
| --- | --- | --- | 
| Lambda timeout | Complex reward calculation | Increase Lambda timeout or optimize function | 
| Permission denied | Missing IAM permissions | Verify SageMaker role can invoke Lambda | 
| Inconsistent scores | Non-deterministic reward function | Use fixed seeds or deterministic logic | 
| Missing results | Lambda errors not caught | Add comprehensive error handling in Lambda | 

### Debug checklist

+ Verify input data follows the correct format with nested content arrays
+ Confirm Lambda ARN is correct and function is deployed
+ Check IAM permissions for SageMaker → Lambda invocation
+ Review CloudWatch logs for Lambda errors
+ Validate Lambda response matches expected format

## Best practices

+ Start Simple: Begin with basic reward functions and iterate
+ Test Lambda Separately: Use Lambda test events before full evaluation
+ Validate on Small Dataset: Run evaluation on subset before full dataset
+ Version Control: Track reward function versions alongside model versions
+ Monitor Costs: Lambda invocations and compute time affect costs
+ Log Extensively: Use print statements in Lambda for debugging
+ Set Timeouts Appropriately: Balance between patience and cost
+ Document Metrics: Clearly define what each metric measures

## Next steps


After completing RFT evaluation:
+ If results are satisfactory: Deploy model to production
+ If improvement needed:
  + Adjust reward function
  + Collect more training data
  + Modify training hyperparameters
  + Run additional RFT training iterations
+ Continuous monitoring: Re-evaluate periodically with new data

# Implementing reward functions
Implementing reward functions

## Overview


The reward function (also called scorer or grader) is the core component that evaluates model responses and provides feedback signals for training. It must be implemented as an Lambda function that accepts model responses and returns reward scores.

## Interface format


Your reward function must accept and return data in the following format:

**Sample input sample to training**

```
{  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        }  
    ],              
   "reference_answer": {  
       "compliant": "No",  
       "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
}
```

**Sample payload for the reward lambda**

The container automatically transforms your data before sending it to your Lambda function by:

1. Generating a model response for each prompt

1. Appending the assistant turn (generated response) to the messages array

1. Adding a unique `id` field for tracking

Your Lambda function will receive data in this transformed format:

```
{    
   "id": "123",  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        },  
        {  
            "role": "assistant",  
            "content": "As an AI developed by Amazon, I don not have a dedicated security team..."  
        }  
    ],              
    # Following section will be same as your training dataset sample  
    "reference_answer": {  
        "compliant": "No",  
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    }  
}
```

**Reward Lambda contract**

```
def lambda_handler(event, context):  
   return lambda_grader(event)  
  
def lambda_grader(samples: list[dict]) -> list[dict]:  
    """  
    Args:  
        samples: List of dictionaries in OpenAI format  
          
        Example input:  
        {     
            "id": "123",  
            "messages": [  
                {  
                    "role": "user",  
                    "content": "Do you have a dedicated security team?"  
                },  
                {  
                    "role": "assistant",  
                    "content": "As an AI developed by Company, I don nott have a dedicated security team..."  
                }  
            ],              
            # This section will be same as your training dataset  
            "reference_answer": {  
                "compliant": "No",  
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
            }  
        }  
      
    Returns:  
        List of dictionaries with reward scores:  
        {  
            "id": str,                              # Same id as input sample  
            "aggregate_reward_score": float,        # Overall score for the sample  
            "metrics_list": [                       # OPTIONAL: Component scores  
                {  
                    "name": str,                    # Name of the component score  
                    "value": float,                 # Value of the component score  
                    "type": str                     # "Reward" or "Metric"  
                }  
            ]  
        }  
    """
```

## Input and output fields


### Input fields



| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Unique identifier for the sample | Echoed back in output. String format | 
| messages | Ordered chat history in OpenAI format | Array of message objects | 
| messages[].role | Speaker of the message | Common values: "user", "assistant", "system" | 
| messages[].content | Text content of the message | Plain string | 
| \$1\$1metadata | Free-form information to aid grading | Object; optional fields passed from training data | 

### Output fields



| Field | Description | Additional notes | 
| --- | --- | --- | 
| id | Same identifier as input sample | Must match input | 
| aggregate\$1reward\$1score | Overall score for the sample | Float (e.g., 0.0–1.0 or task-defined range) | 
| metrics\$1list | Component scores that make up the aggregate | Array of metric objects | 

## Technical constraints

+ **Timeout limit** – 15 minutes maximum execution time per Lambda invocation
+ **Concurrency** – Must handle `rollout_worker_replicas * 64` concurrent requests
+ **Reliability** – Must implement proper error handling and return valid scores consistently
+ **Performance** – Optimize for fast execution (seconds, not minutes) to enable efficient training

**Best practices**
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement retry logic for transient failures
+ Cache reusable computations
+ Test thoroughly before training to ensure bug-free execution

## Using custom reward functions


Implement custom reward functions when you have task-specific evaluation criteria:
+ **Define evaluation criteria** – Determine what makes a good response for your task
+ **Implement Lambda function** – Create an Lambda function following the interface format
+ **Test locally** – Validate your function returns correct scores for sample inputs
+ **Deploy to AWS** – Deploy your Lambda and note the ARN
+ **Configure recipe** – Add the Lambda ARN to your recipe's `reward_lambda_arn` field
+ **Test with small dataset** – Run RFT with minimal data to verify integration

## IAM permissions


### Required permissions


Your SageMaker execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker execution role:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "lambda:InvokeFunction"  
      ],  
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"  
    }  
  ]  
}
```

### Lambda execution role


Your Lambda function's execution role needs basic Lambda execution permissions:

```
{  
  "Version": "2012-10-17",		 	 	   
  "Statement": [  
    {  
      "Effect": "Allow",  
      "Action": [  
        "logs:CreateLogGroup",  
        "logs:CreateLogStream",  
        "logs:PutLogEvents"  
      ],  
      "Resource": "arn:aws:logs:*:*:*"  
    }  
  ]  
}
```

Additional permissions: If your Lambda function accesses other AWS services (for example, S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Example: LLM As a Judge reward function


This example demonstrates using Amazon Bedrock models as judges to evaluate model responses by comparing them against reference answers. This Lambda template provides a framework for customers to implement calls to Amazon Bedrock for inference requests to process judge evaluations. The Lambda function maintains the same input/output contract as other reward functions.

### Implementation


This Lambda function implements a two-stage evaluation process: the `lambda_handler` extracts model responses and reference answers from incoming samples, then the `lambda_graded` function calls Amazon Bedrock to score the semantic similarity between them. The implementation includes robust error handling with automatic retries for transient failures and supports flexible reference answer formats (both string and structured dictionary formats).

**Implementation details:**
+ **Retry Logic**: Implements exponential backoff (1s, 2s, 4s) for throttling exceptions to handle Bedrock API rate limits
+ **Error Handling**: Returns score of 0.0 for failed evaluations rather than raising exceptions
+ **Deterministic Scoring**: Uses temperature=0.0 to ensure consistent scores across evaluations
+ **Flexible Reference Format**: Automatically handles both string and dictionary reference answers
+ **Score Clamping**: Ensures all scores fall within valid [0.0, 1.0] range
+ **Model Agnostic**: Change JUDGE\$1MODEL\$1ID to use any Amazon Bedrock model (Nova, Llama, Mistral, etc.)

```
"""  
LLM Judge Lambda POC - Working implementation using Amazon Bedrock  
"""  
  
import json  
import time  
import boto3  
  
bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')  
JUDGE_MODEL_ID = "anthropic.claude-3-5-sonnet-20240620-v1:0"  
SYSTEM_PROMPT = "You must output ONLY a number between 0.0 and 1.0. No explanations, no text, just the number."  
  
JUDGE_PROMPT_TEMPLATE = """Compare the following two responses and rate how similar they are on a scale of 0.0 to 1.0, where:  
- 1.0 means the responses are semantically equivalent (same meaning, even if worded differently)  
- 0.5 means the responses are partially similar  
- 0.0 means the responses are completely different or contradictory  
  
Response A: {response_a}  
  
Response B: {response_b}  
  
Output ONLY a number between 0.0 and 1.0. No explanations."""  
  
  
def lambda_graded(response_a: str, response_b: str, max_retries: int = 3) -> float:  
    """Call Bedrock to compare responses and return similarity score."""  
    prompt = JUDGE_PROMPT_TEMPLATE.format(response_a=response_a, response_b=response_b)  
      
    for attempt in range(max_retries):  
        try:  
            response = bedrock_runtime.converse(  
                modelId=JUDGE_MODEL_ID,  
                messages=[{"role": "user", "content": [{"text": prompt}]}],  
                system=[{"text": SYSTEM_PROMPT}],  
                inferenceConfig={"temperature": 0.0, "maxTokens": 10}  
            )  
            print(f"Bedrock call successful: {response}")  
            output = response['output']['message']['content'][0]['text'].strip()  
            score = float(output)  
            print(f"Score parsed: {score}")  
            return max(0.0, min(1.0, score))  
                  
        except Exception as e:  
            if "ThrottlingException" in str(e) and attempt < max_retries - 1:  
                time.sleep(2 ** attempt)  
            else:  
                print(f"Bedrock call failed: {e}")  
                return None  
    return None  
  
  
def lambda_handler(event, context):  
    """AWS Lambda handler - processes samples from RFTEvalInvoker."""  
    try:  
        samples = event if isinstance(event, list) else [event]  
        results = []  
          
        for sample in samples:  
            sample_id = sample.get("id", "unknown")  
            messages = sample.get("messages", [])  
              
            # Extract assistant response (response A)  
            response_a = ""  
            for msg in messages:  
                if msg.get("role") in ["assistant", "nova_assistant"]:  
                    response_a = msg.get("content", "")  
                    break  
              
            # Extract reference answer from root level (no longer in metadata)  
            reference_answer = sample.get("reference_answer", "")  
              
            # Handle both string and dict reference_answer formats  
            if isinstance(reference_answer, dict):  
                # If reference_answer is a dict, extract the explanation or compliant field  
                response_b = reference_answer.get("explanation", reference_answer.get("compliant", ""))  
            else:  
                response_b = reference_answer  
              
            if not response_a or not response_b:  
                results.append({  
                    "id": sample_id,  
                    "aggregate_reward_score": 0.0,  
                    "metrics_list": [{"name": "similarity_score", "value": 0.0, "type": "Metric"}]  
                })  
                continue  
              
            # Get similarity score  
            score = lambda_graded(response_a, response_b)  
              
            results.append({  
                "id": sample_id,  
                "aggregate_reward_score": score,  
                "metrics_list": [  
                    {  
                        "name": "similarity_score",  
                        "value": score,  
                        "type": "Metric"  
                    }  
                ]  
            })  
          
        return {"statusCode": 200, "body": json.dumps(results)}  
          
    except Exception as e:  
        print(f"Error: {e}")  
        return {"statusCode": 500, "body": json.dumps({"error": str(e)})}
```

### Input format


The Lambda receives the same input format as other reward functions:

```
{  
    "id": "sample-001",  
    "messages": [  
        {  
            "role": "user",  
            "content": "Do you have a dedicated security team?"  
        },  
        {  
            "role": "assistant",  
            "content": "As an AI developed by Amazon, I don't have a dedicated security team..."  
        }  
    ],  
    "reference_answer": {  
        "compliant": "No",  
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."  
    },  
    "my_custom_field": "custom_value"  
}
```

### Output format


```
{  
    "id": "sample-001",  
    "aggregate_reward_score": 0.85,  
    "metrics_list": [  
        {  
            "name": "similarity_score",  
            "value": 0.85,  
            "type": "Metric"  
        }  
    ]  
}
```

### Deployment considerations


You may also need to adjust the prompt template and inference parameters based on your chosen model's capabilities and API format.
+ **IAM Permissions**: Lambda execution role must have `bedrock:InvokeModel` permission for your chosen model
+ **Timeout**: Set Lambda timeout to at least 60 seconds to accommodate Bedrock API latency and retries
+ **Region**: Deploy in a region where your chosen Bedrock model is available
+ **Cost**: Monitor Bedrock API usage as each evaluation makes one API call per sample
+ **Throughput**: For large-scale evaluations, request increased Bedrock quotas to avoid throttling

**Increasing Bedrock Throughput**

If you experience throttling during evaluation, increase your Bedrock model quotas:
+ Navigate to the AWS Service Quotas console
+ Search for "Bedrock" and select your region
+ Find the quota for your chosen model (for example, "Invocations per minute for Claude 3.5 Sonnet")
+ Click "Request quota increase" and specify your desired throughput
+ Provide justification for the increase (for example, "RFT evaluation workload")

The Lambda's built-in retry logic handles occasional throttling, but sustained high-volume evaluations require appropriate quota increases.

**Required IAM Policy:**

```
{  
    "Version": "2012-10-17",		 	 	   
    "Statement": [  
        {  
            "Effect": "Allow",  
            "Action": [  
                "bedrock:InvokeModel"  
            ],  
            "Resource": "arn:aws:bedrock:*::foundation-model/*"  
        }  
    ]  
}
```

# Iterative training
Iterative training

Iterative training is a systematic approach to fine-tuning models through multiple training cycles, where each round builds on the previous checkpoint by addressing specific weaknesses discovered through evaluation. This method enables targeted improvements to model performance by incorporating curated examples that address failure modes, adapting to changing requirements, and validating enhancements incrementally rather than committing to a single long training run. The process typically follows patterns like SFT (Supervised Fine-Tuning) followed by RFT (Reward-based Fine-Tuning), with checkpoints stored in AWS-managed escrow S3 buckets that can be referenced for subsequent training iterations while maintaining consistency in model type and training technique throughout the pipeline.

For more details, refer to [Iterative Training](nova-iterative-training.md).

# Amazon Nova customization on SageMaker HyperPod
On SageMaker HyperPod

You can customize Amazon Nova models, including the enhanced Amazon Nova 2.0 models, using [ Amazon Nova recipes](nova-model-recipes.md) and train them on Hyperpod. A recipe is a YAML configuration file that provides details to SageMaker AI on how to run your model customization job. SageMaker HyperPod supports two types of services: Forge and Non-forge.

Hyperpod offers high-performance computing with optimized GPU instances and Amazon FSx for Lustre storage, robust monitoring through integration with tools like TensorBoard, flexible checkpoint management for iterative improvement, seamless deployment to Amazon Bedrock for inference, and efficient scalable multi-node distributed training-all working together to provide organizations with a secure, performant, and flexible environment to tailor Amazon Nova models to their specific business requirements.

Amazon Nova customization on SageMaker HyperPod stores model artifacts including model checkpoints in a service-managed Amazon S3 bucket. Artifacts in the service-managed bucket are encrypted with SageMaker AI-managed AWS KMS keys. Service-managed Amazon S3 buckets don't currently support data encryption using customer-managed KMS keys. You can use this checkpoint location for evaluation jobs or Amazon Bedrock inference.

Standard pricing can apply for compute instances, Amazon S3 storage, and FSx for Lustre. For pricing details, see [Hyperpod pricing](https://aws.amazon.com/sagemaker-ai/pricing/), [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/), and [FSx for Lustre pricing](https://aws.amazon.com/fsx/lustre/pricing/).

## Compute requirements for Amazon Nova 2 models


The following tables summarize the computational requirements for SageMaker HyperPod and SageMaker AI training jobs training for Amazon Nova 2 models.


**Nova 2 Training Requirements**  

| Training Technique | Minimum Instances | Instance Type | GPU Count | Notes | Supported Models | 
| --- |--- |--- |--- |--- |--- |
| SFT (LoRA) | 4 | P5.48xlarge | 16 | Parameter-efficient fine-tuning | Nova 2 Lite | 
| SFT (Full Rank) | 4 | P5.48xlarge | 32 | Full model fine-tuning | Nova 2 Lite | 
| RFT on SageMaker Training Jobs (LoRA) | 2 | P5.48xlarge | 16 | Custom Reward Functions in your AWS Environment | Nova 2 Lite | 
| RFT on SageMaker Training Jobs (Full Rank) | 4 | P5.48xlarge | 32 | 32K context length | Nova 2 Lite | 
| RFT on SageMaker HyperPod | 8 | P5.48xlarge | 64 | Default 8192 context length | Nova 2 Lite | 
| CPT | 4 | P5.48xlarge | 16 | Processes approximately 400M tokens per instance per day | Nova 2 Lite | 

To optimize your Amazon Nova model customization workflows on Hyperpod, follow these recommended best practices for efficient training, resource management, and successful model deployment.

## Best Practices for Amazon Nova customization


### Overview


This section provides an overview of customization techniques and helps you choose the best approach for your needs and available data.

#### Two stages of LLM training


Large language model training consists of two major stages: pre-training and post-training. During pre-training, the model processes tokens of raw text and optimizes for next-token prediction. This process creates a pattern completer that absorbs syntax, semantics, facts, and reasoning patterns from web and curated text. However, the pre-trained model doesn't understand instructions, user goals, or context-appropriate behavior. It continues text in whatever style fits its training distribution. A pre-trained model autocompletes rather than follows directions, produces inconsistent formatting, and can mirror undesirable biases or unsafe content from the training data. Pre-training builds general competence, not task usefulness.

Post-training transforms the pattern completer into a useful assistant. You run multiple rounds of Supervised Fine-Tuning (SFT) to teach the model to follow instructions, adhere to schemas and policies, call tools, and produce reliable outputs by imitating high-quality demonstrations. This alignment teaches the model to respond to prompts as tasks rather than text to continue. You then apply Reinforcement Fine-Tuning (RFT) to optimize behavior using measurable feedback (such as verifiers or an LLM-as-a-judge), balancing trade-offs like accuracy versus brevity, safety versus coverage, or multi-step reasoning under constraints. In practice, you alternate SFT and RFT in cycles to shape the pre-trained model into a reliable, policy-aligned system that performs complex tasks consistently.

### Choose the right customization approach


In this section we will cover post training customization strategies: RFT and SFT.

#### Reinforcement fine-tuning (RFT)


Reinforcement fine-tuning improves model performance through feedback signals—measurable scores or rewards that indicate response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards. This approach works well for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications that require nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that you can programmatically evaluate. For example, answering complex legal questions is an ideal use case for RFT because you want to teach the model how to reason better to answer questions more accurately.

##### How it works


In reinforcement fine-tuning, you start from an instruction-tuned baseline and treat each prompt like a small tournament. For a given input, you sample a handful of candidate answers from the model, score each one with the reward function, then rank them within that group. The update step nudges the model to make higher-scoring candidates more likely next time and lower-scoring ones less likely, while a stay-close-to-baseline constraint keeps behavior from drifting or becoming verbose or exploitative. You repeat this loop over many prompts, refreshing hard cases, tightening verifiers or judge rubrics when you see exploits, and continuously tracking task metrics.

##### When to use RFT


Tasks that benefit most from RFT share several traits. They have measurable success signals even when a single correct output is hard to specify. They admit partial credit or graded quality so you can rank better versus worse answers within a prompt or using a reward function. They involve multiple objectives that must be balanced (such as accuracy with brevity, clarity, safety, or cost). They require adherence to explicit constraints that you can programmatically check. They operate in tool-mediated or environment-based settings where outcomes are observable (success or failure, latency, resource use). They occur in low-label regimes where collecting gold targets is expensive but automated or rubric-based feedback is plentiful. RFT works best when you can turn quality into a reliable scalar or ranking and want the model to preferentially amplify higher-scoring behaviors without needing exhaustive labeled targets.

**Consider other methods when:**
+ You have plentiful, reliable labeled input-output pairs – Use SFT
+ The main gap is knowledge or jargon – Use retrieval-augmented generation (RAG)
+ Your reward signal is noisy or unreliable and you can't fix it with better rubrics or checkers – Stabilize that first before RFT

##### When not to use RFT


Avoid RFT in these situations:
+ You can cheaply produce reliable labeled input-output pairs (SFT is simpler, cheaper, and more stable)
+ The gap is knowledge or jargon rather than behavior (use RAG)
+ Your reward signal is noisy, sparse, easy to game, or expensive or slow to compute (fix the evaluator first)
+ Baseline performance is near-zero (bootstrap with SFT before optimizing preferences)
+ The task has deterministic schemas, strict formatting, or a single correct answer (SFT or rule-based validation works better)
+ Tight latency or cost budgets can't absorb the extra sampling or exploration RFT requires
+ Safety or policy constraints aren't crisply specified and enforceable in the reward

If you can point to "the right answer," use SFT. If you need new knowledge, use RAG. Use RFT only after you have a solid baseline and a robust, fast, hard-to-exploit reward function.

#### Supervised fine-tuning (SFT)


Supervised fine-tuning trains the LLM on a dataset of human-labeled input-output pairs for your task. You provide examples of prompts (questions, instructions, and so on) with the correct or desired responses, and continue training the model on these examples. The model adjusts its weights to minimize a supervised loss (typically cross-entropy between its predictions and the target output tokens). This is the same training used in most supervised machine learning tasks, applied to specialize an LLM.

SFT changes behavior, not knowledge. It doesn't teach the model new facts or jargon it didn't see in pre-training. It teaches the model how to answer, not what to know. If you need new domain knowledge (such as internal terminology), use retrieval-augmented generation (RAG) to provide that context at inference time. SFT then adds the desired instruction-following behavior on top.

##### How it works


SFT optimizes LLM by minimizing the average cross-entropy loss on response tokens, treating prompt tokens as context and masking them from the loss. The model internalizes your target style, structure, and decision rules, learning to generate the correct completion for each prompt. For example, to classify documents into custom categories, you fine-tune the model with prompts (the document text) and labeled completions (the category labels). You train on those pairs until the model outputs the right label for each prompt with high probability.

You can perform SFT with as few as a few hundred examples and scale up to a few hundred thousand. SFT samples must be high quality and directly aligned with the desired model behavior.

##### When to use SFT


Use SFT when you have a well-defined task with clear desired outputs. If you can explicitly state "Given X input, the correct output is Y" and gather examples of such mappings, supervised fine-tuning is a good choice. SFT excels in these scenarios:
+ **Structured or complex classification tasks** – Classify internal documents or contracts into many custom categories. With SFT, the model learns these specific categories better than prompting alone.
+ **Question-answering or transformation tasks with known answers** – Fine-tune a model to answer questions from a company's knowledge base, or convert data between formats where each input has a correct response.
+ **Formatting and style consistency** – Train the model to always respond in a certain format or tone by fine-tuning on examples of the correct format or tone. For instance, training on prompt-response pairs that demonstrate a particular brand voice teaches the model to generate outputs with that style. Instruction-following behavior is often initially taught through SFT on curated examples of good assistant behavior.

SFT is the most direct way to teach an LLM a new skill or behavior when you can specify what the right behavior looks like. It uses the model's existing language understanding and focuses it on your task. Use SFT when you want the model to do a specific thing and you have or can create a dataset of examples.

Use SFT when you can assemble high-quality prompt and response pairs that closely mirror the behavior you want. It fits tasks with clear targets or deterministic formats such as schemas, function or tool calls, and structured answers where imitation is an appropriate training signal. The goal is behavior shaping: teaching the model to treat prompts as tasks, follow instructions, adopt tone and refusal policies, and produce consistent formatting. Plan for at least hundreds of demonstrations, with data quality, consistency, and deduplication mattering more than raw volume. For a straightforward, cost-efficient update, use parameter-efficient methods like Low-Rank Adaptation to train small adapters while leaving most of the backbone untouched.

##### When not to use SFT


Don't use SFT when the gap is knowledge rather than behavior. It doesn't teach the model new facts, jargon, or recent events. In those cases, use retrieval-augmented generation to bring external knowledge at inference. Avoid SFT when you can measure quality but can't label a single right answer. Use reinforcement fine-tuning with verifiable rewards or an LLM-as-a-judge to optimize those rewards directly. If your needs or content change frequently, rely on retrieval and tool use rather than retraining the model.

**Topics**
+ [

## Compute requirements for Amazon Nova 2 models
](#nova-hp-compute-2)
+ [

## Best Practices for Amazon Nova customization
](#best-practices)
+ [

# Nova Forge SDK
](nova-hp-forge-sdk.md)
+ [

# Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)
](nova-hp-cluster.md)
+ [

# Amazon SageMaker HyperPod Essential Commands Guide
](nova-hp-essential-commands-guide.md)
+ [

# Nova Forge access and setup for
](nova-forge-hp-access.md)
+ [

# Training for Amazon Nova models
](nova-hp-training.md)
+ [

# Evaluating your trained model
](nova-hp-evaluate.md)
+ [

# Monitoring HyperPod jobs with MLflow
](nova-hp-mlflow.md)

# Nova Forge SDK


The Amazon Nova Forge SDK is a comprehensive Python SDK that provides a unified, programmatic interface for the complete Amazon Amazon Nova model customization lifecycle. The SDK simplifies model customization by offering a single, consistent API for training, evaluation, monitoring, deployment, and inference across Amazon SageMaker and Amazon Bedrock platforms.

For more information, see [Nova Forge SDK](nova-forge-sdk.md).

# Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)
HP cluster setup

To customize a model on Hyperpod, the necessary infrastructure must be set up. For details on setting up a SageMaker HyperPod EKS cluster with a restricted instance group (RIG), visit the [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US), which provides a detailed walkthrough of the setup process.

# Amazon SageMaker HyperPod Essential Commands Guide
Essential Commands Guide

Amazon SageMaker HyperPod provides extensive command-line functionality for managing training workflows. This guide covers essential commands for common operations, from connecting to your cluster to monitoring job progress.

**Prerequisites**  
Before using these commands, ensure you have completed the following setup:
+ SageMaker HyperPod cluster with RIG created (typically in us-east-1)
+ Output Amazon S3 bucket created for training artifacts
+ IAM roles configured with appropriate permissions
+ Training data uploaded in correct JSONL format
+ FSx for Lustre sync completed (verify in cluster logs on first job)

**Topics**
+ [

## Installing Recipe CLI
](#nova-hp-essential-commands-guide-install)
+ [

## Connecting to your cluster
](#nova-hp-essential-commands-guide-connect)
+ [

## Starting a training job
](#nova-hp-essential-commands-guide-start-job)
+ [

## Checking job status
](#nova-hp-essential-commands-guide-status)
+ [

## Monitoring job logs
](#nova-hp-essential-commands-guide-logs)
+ [

## Listing active jobs
](#nova-hp-essential-commands-guide-list-jobs)
+ [

## Canceling a job
](#nova-hp-essential-commands-guide-cancel-job)
+ [

## Running an evaluation job
](#nova-hp-essential-commands-guide-evaluation)
+ [

## Common issues
](#nova-hp-essential-commands-guide-troubleshooting)

## Installing Recipe CLI


Navigate to the root of your recipe repository before running the installation command.

**Use the Hyperpodrecipes repository if using Non Forge customization techniques, for Forge based customization refer to the forge specific recipe repository.**  
Run the following commands to install the SageMaker HyperPod CLI:

**Note**  
Make sure you aren’t in an active conda / anaconda / miniconda environment or another virtual environment  
If you are, please exit the environment using:  
`conda deactivate` for conda / anaconda / miniconda environments
`deactivate` for python virtual environments

 If you are using a Non Forge customization technique, download the sagemaker-hyperpod-recipes as shown below:

```
git clone -b release_v2 https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli
pip install -e .
cd ..
root_dir=$(pwd)
export PYTHONPATH=${root_dir}/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

If you are a** Forge Subscriber,** you should be downloading the recipes using below mentioned process.

```
mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

**Tip**  
To use a [new virtual environment](https://docs.python.org/3/library/venv.html) before running `pip install -e .`, run:  
`python -m venv nova_forge`
`source nova_forge/bin/activate`
Your command line will now display (nova\$1forge) at the beginning of your prompt
This ensures there are no competing dependencies when using the CLI

**Purpose**: Why do we do `pip install -e .` ?

This command installs the SageMaker HyperPod CLI in editable mode, allowing you to use updated recipes without reinstalling each time. It also enables you to add new recipes that the CLI can automatically pick up.

## Connecting to your cluster


Connect the SageMaker HyperPod CLI to your cluster before running any jobs:

```
export AWS_REGION=us-east-1 &&  SageMaker HyperPod  connect-cluster --cluster-name <your-cluster-name> --region us-east-1
```

**Important**  
This command creates a context file (`/tmp/hyperpod_context.json`) that subsequent commands require. If you see an error about this file not found, re-run the connect command.

**Pro tip**: You can further configure your cluster to always use the `kubeflow` namespace by adding the `--namespace kubeflow` argument to your command as follows:

```
export AWS_REGION=us-east-1 && \
hyperpod connect-cluster \
--cluster-name <your-cluster-name> \
--region us-east-1 \
--namespace kubeflow
```

This saves you the effort of adding the `-n kubeflow` in every command when interacting with your jobs.

## Starting a training job


**Note**  
If running PPO/RFT jobs, ensure you add label selector settings to `src/hyperpod_cli/sagemaker_hyperpod_recipes/recipes_collection/cluster/k8s.yaml` so that all pods are schedule on the same node.  

```
label_selector:
  required:
    sagemaker.amazonaws.com/instance-group-name:
      - <rig_group>
```

Launch a training job using a recipe with optional parameter overrides:

```
hyperpod start-job -n kubeflow \
--recipe fine-tuning/nova/nova_1_0/nova_micro/SFT/nova_micro_1_0_p5_p4d_gpu_lora_sft \
--override-parameters '{
"instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest"
  }'
```

**Expected output**:

```
Final command: python3 <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=fine-tuning/nova/nova_micro_p5_gpu_sft cluster_type=k8s cluster=k8s base_results_dir=/local/home/<username>/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" container="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:HP-SFT-DATAMIX-latest"

Prepared output directory at /local/home/<username>/results/<job-name>/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/<username>/results/<job-name>/<job-name>_launch.sh
Running Helm script: /local/home/<username>/results/<job-name>/<job-name>_launch.sh

NAME: <job-name>
LAST DEPLOYED: Mon Sep 15 20:56:50 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/<your-cluster-name>"
}
```

## Checking job status


Monitor your running jobs using kubectl:

```
kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep <your-job-name>)
```

**Understanding pod statuses**  
The following table explains common pod statuses:


| Status | Description | 
| --- |--- |
| `Pending` | Pod accepted but not yet scheduled onto a node, or waiting for container images to be pulled | 
| `Running` | Pod bound to a node with at least one container running or starting | 
| `Succeeded` | All containers completed successfully and won't restart | 
| `Failed` | All containers terminated with at least one ending in failure | 
| `Unknown` | Pod state cannot be determined (usually due to node communication issues) | 
| `CrashLoopBackOff` | Container repeatedly failing; Kubernetes backing off from restart attempts | 
| `ImagePullBackOff` / `ErrImagePull` | Unable to pull container image from registry | 
| `OOMKilled` | Container terminated for exceeding memory limits | 
| `Completed` | Job or Pod finished successfully (batch job completion) | 

**Tip**  
Use the `-w` flag to watch pod status updates in real-time. Press `Ctrl+C` to stop watching.

## Monitoring job logs


You can view your logs one of three ways:

**Using CloudWatch**  
Your logs are available in your AWS account that contains the Hyperpodcluster under CloudWatch. To view them in your browser, navigate to the CloudWatch homepage in your account and search for your cluster name. For example, if your cluster were called `my-hyperpod-rig` the log group would have the prefix:
+ **Log group**: `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}`
+ Once you're in the log group, you can find your specific log using the node instance ID such as - `hyperpod-i-00b3d8a1bf25714e4`.
  + `i-00b3d8a1bf25714e4` here represents the Hyperpodfriendly machine name where your training job is running. Recall how in the previous command `kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-cpt-run)` output we captured a column called **NODE**.
  + The "master" node run was in this case running on hyperpod-`i-00b3d8a1bf25714e4` and thus we'll use that string to select the log group to view. Select the one that says `SagemakerHyperPodTrainingJob/rig-group/[NODE]`

**Using CloudWatch Insights**  
If you have your job name handy and don't wish to go through all the steps above, you can simply query all logs under `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}` to find the individual log.

CPT:

```
fields @timestamp, @message, @logStream, @log
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc
| limit 100
```

For job completion replace `Starting CPT Job` with `CPT Job completed`

Then you can click through the results and pick the one that says "Epoch 0" since that will be your master node.

**Using the AWS CLI**  
You may choose to tail your logs using the AWS CLI. Before doing so, please check your AWS CLI version using `aws --version`. It is also recommended to use this utility script that helps in live log tracking in your terminal

**for V1**:

```
aws logs get-log-events \
--log-group-name /aws/sagemaker/YourLogGroupName \
--log-stream-name YourLogStream \
--start-from-head | jq -r '.events[].message'
```

**for V2**:

```
aws logs tail /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
--since 10m \
--follow
```

## Listing active jobs


View all jobs running in your cluster:

```
hyperpod list-jobs -n kubeflow
```

**Example output**:

```
{
  "jobs": [
    {
      "Name": "test-run-nhgza",
      "Namespace": "kubeflow",
      "CreationTime": "2025-10-29T16:50:57Z",
      "State": "Running"
    }
  ]
}
```

## Canceling a job


Stop a running job at any time:

```
hyperpod cancel-job --job-name <job-name> -n kubeflow
```

**Finding your job name**  
**Option 1: From your recipe**

The job name is specified in your recipe's `run` block:

```
run:
  name: "my-test-run"                        # This is your job name
  model_type: "amazon.nova-micro-v1:0:128k"
  ...
```

**Option 2: From list-jobs command**

Use `hyperpod list-jobs -n kubeflow` and copy the `Name` field from the output.

## Running an evaluation job


Evaluate a trained model or base model using an evaluation recipe.

**Prerequisites**  
Before running evaluation jobs, ensure you have:
+ Checkpoint Amazon S3 URI from your training job's `manifest.json` file (for trained models)
+ Evaluation dataset uploaded to Amazon S3 in the correct format
+ Output Amazon S3 path for evaluation results

**Command**  
Run the following command to start an evaluation job:

```
hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'
```

**Parameter descriptions**:
+ `recipes.run.name`: Unique name for your evaluation job
+ `recipes.run.model_name_or_path`: Amazon S3 URI from `manifest.json` or base model path (e.g., `nova-micro/prod`)
+ `recipes.run.output_s3_path`: Amazon S3 location for evaluation results
+ `recipes.run.data_s3_path`: Amazon S3 location of your evaluation dataset

**Tips**:
+ **Model-specific recipes**: Each model size (micro, lite, pro) has its own evaluation recipe
+ **Base model evaluation**: Use base model paths (e.g., `nova-micro/prod`) instead of checkpoint URIs to evaluate base models

**Evaluation data format**  
**Input format (JSONL)**:

```
{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}
```

**Output format**:

```
{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}
```

**Field descriptions**:
+ `prompt`: Formatted input sent to the model
+ `inference`: Model's generated response
+ `gold`: Expected correct answer from input dataset
+ `metadata`: Optional metadata passed through from input

## Common issues

+ `ModuleNotFoundError: No module named 'nemo_launcher'`, you might've to add `nemo_launcher` to your python path based on where `hyperpod_cli` is installed. Sample command:

  ```
  export PYTHONPATH=<path_to_hyperpod_cli>/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
  ```
+ `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/hyperpod_current_context.json'` indicates you missed running the hyperpod connect cluster command.
+ If you don't see your job scheduled, double check if the output of your SageMaker HyperPod CLI has this section with job names and other metadata. If not, re-install helm chart by running:

  ```
  curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
  chmod 700 get_helm.sh
  ./get_helm.sh
  rm -f ./get_helm.sh
  ```

# Nova Forge access and setup for
Nova Forge access and setup

To set up Amazon Nova Forge for use with Jobs, you need to:
+ Subscribe to Amazon Nova Forge
+ Set up a cluster

**Topics**
+ [

# Subscribe to Amazon Nova Forge
](nova-forge-subscribing.md)
+ [

# Set up infrastructure
](nova-forge-hyperpod-setup.md)
+ [

# Responsible AI
](nova-forge-responsible-ai.md)

# Subscribe to Amazon Nova Forge
Subscribe to Nova Forge

To access Amazon Nova Forge features, complete the following steps:

1. Verify administrator access to the AWS account.

1. Navigate to the SageMaker AI console and [ request access to Amazon Nova Forge](nova-forge.md).

1. Wait for the Amazon Nova team to email a confirmation after the subscription request is approved.

1. Tag your execution role with the `forge-subscription` tag. This tag is required for accessing Amazon Nova Forge features and checkpoints. Add the following tag to your execution role:
   + Key: `forge-subscription`
   + Value: `true`

**Note**  
Standard Amazon Nova features remain available without a Forge subscription. Amazon Nova Forge is designed for building custom frontier models with control and flexibility across all model training phases.

# Set up infrastructure
Set up HyperPod infrastructure

Once your Amazon Nova Forge subscription is approved, set up the necessary infrastructure to use Forge-enabled features. For detailed instructions on creating a EKS cluster with a restricted instance group (RIG), follow the [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US) instructions.

# Responsible AI
Responsible AI

**Content moderation settings**: Amazon Nova Forge customers have access to Customizable Content Moderation Settings (CCMS) for Amazon Nova Lite 1.0 and Pro 1.0 models. CCMS allows you to adjust content moderation controls to align with your specific business requirements while maintaining essential responsible AI safeguards. To determine if your business use case qualifies for CCMS, contact your Amazon Web Services account manager.

Amazon Nova Forge provides a Responsible AI Toolkit that includes training data, evaluation benchmarks, and runtime controls to help you align your models with Amazon Nova's responsible AI guidelines.

**Training data**: The "RAI" category in data mixing contains cases and scenarios emphasizing responsible AI principles, safety considerations, and responsible technology deployment. Use these to align your models responsibly during continued pre-training.

**Evaluations**: Benchmark tasks are available to test your model's ability to detect and reject inappropriate, harmful, or incorrect content. Use these evaluations to measure the difference between base model performance and your custom model performance.

# Training for Amazon Nova models
Training

Training Amazon Nova models on SageMaker HyperPod supports multiple techniques including Continued Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Fine-Tuning (RFT). Each technique serves different customization needs and can be applied to different Amazon Nova model versions.

**Topics**
+ [

# Continued pre-training (CPT)
](nova-cpt.md)
+ [

# Supervised fine-tuning (SFT)
](nova-fine-tune.md)
+ [

# Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod
](nova-hp-rft.md)

# Continued pre-training (CPT)
Continued pre-training (CPT)

Continued pre-training (CPT) is a training technique that extends the pre-training phase of a foundation model by exposing it to additional unlabeled text from specific domains or corpora. Unlike supervised fine-tuning, which requires labeled input-output pairs, CPT trains on raw documents to help the model acquire deeper knowledge of new domains, learn domain-specific terminology and writing patterns, and adapt to particular content types or subject areas.

This approach is particularly valuable when you have large volumes (tens of billions of tokens) of domain-specific text data, such as legal documents, medical literature, technical documentation, or proprietary business content, and you want the model to develop native fluency in that domain. Generally, after the CPT stage, the model needs to undergo additional instruction tuning stages to enable the model to use the newly acquired knowledge and complete useful tasks.

**Supported models**  
CPT is available for the following Amazon Nova models:
+ Nova 1.0 (Micro, Lite, Pro)
+ Nova 2.0 (Lite)

**When to use Nova 1.0 versus Nova 2.0**  
The Amazon Nova family of models offers multiple price-performance operating points to optimize between accuracy, speed, and cost.

Choose Nova 2.0 when you need the following:
+ Advanced reasoning capabilities for complex analytical tasks
+ Superior performance on coding, math, and scientific problem-solving
+ Longer context length support
+ Better multilingual performance

**Note**  
The larger model is not always better. Consider the cost-performance tradeoff and your specific business requirements when selecting between Nova 1.0 and Nova 2.0 models.

# CPT on Nova 2.0


Amazon Nova Lite 2.0 is a reasoning model trained on larger and more diverse datasets than Nova Lite 1.0. Despite being a larger model, Nova Lite 2.0 delivers faster inference than Nova Lite 1.0 while offering enhanced reasoning capabilities, longer context lengths, and improved multilingual performance.

CPT on Nova 2.0 allows you to extend these advanced capabilities with your domain-specific data, enabling the model to develop deep expertise in specialized areas while maintaining its superior reasoning and analytical abilities.

## Sample CPT recipe


The following is a sample recipe for CPT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/nova) repository.

```
# Note:
# This recipe can run on p5.48xlarge
# Run config
run:
  name: "my-cpt-run"                           # A descriptive name for your training job
  model_type: "amazon.nova-2-lite-v1:0:256k"   # Model variant specification, do not change
  model_name_or_path: "nova-lite-2/prod"        # Base model path, do not change
  replicas: 8                                   # Number of compute instances for training, allowed values are 4, 8, 16, 32
  data_s3_path: ""                              # Customer data paths
  validation_data_s3_path: ""                   # Customer validation data paths
  output_s3_path: ""                            # Output artifact path,  job-specific configuration - not compatible with standard SageMaker Training Jobs
  mlflow_tracking_uri: ""                       # Required for MLFlow
  mlflow_experiment_name: "my-cpt-experiment"   # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-cpt-run"                 # Optional for MLFlow. Note: leave this field non-empty

## Training specific configs
training_config:
  task_type: cpt
  max_length: 8192                              # Maximum context window size (tokens)
  global_batch_size: 256                        # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
    max_steps: 10                               # The number of training steps to run total
    val_check_interval: 10                      # The number of steps between running validation. Integer count or float percentage
    limit_val_batches: 2                        # Batches of the validation set to use each trigger

  model:
    hidden_dropout: 0.0                         # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0                      # Dropout for attention weights, must be between 0.0 and 1.0

  optim:
    optimizer: adam
    lr: 1e-5                                    # Learning rate
    name: distributed_fused_adam                # Optimizer algorithm, do not change
    adam_w_mode: true                           # Enable AdamW mode
    eps: 1e-06                                  # Epsilon for numerical stability
    weight_decay: 0.0                           # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                             # Beta1 for Adam optimizer
    adam_beta2: 0.95                            # Beta2 for Adam optimizer
    sched:
      warmup_steps: 10                          # Learning rate warmup steps
      constant_steps: 0                         # Steps at constant learning rate
      min_lr: 1e-6                              # Minimum learning rate, must be lower than lr
```

## Data preparation for CPT on 2.0


**Data format requirements**  
Training and validation datasets must be JSONL files following the format shown below, where each line contains a JSON object representing a conversation with the required fields and structure. Here is an example:

```
{"text": "AWS stands for Amazon Web Services"}
{"text": "Amazon SageMaker is a fully managed machine learning service"}
{"text": "Amazon Bedrock is a fully managed service for foundation models"}
```

Text entries should contain naturally flowing, high-quality content that represents the target domain.

Test that the data is capable of being converted into [Arrow format](https://huggingface.co/docs/datasets/en/about_arrow). Use the python script below to help with it. Ensure the `datasets==2.18.0` version at minimum is used:

```
from datasets import load_dataset, load_from_disk
from pathlib import Path

input_path = Path("<Your jsonl file>")
output_path = Path("<Your output directory>")

dataset = load_dataset("json", data_files=str(input_path), split="train")
dataset.save_to_disk(str(output_path), max_shard_size="1GB")

try:
  test_dataset = datasets.load_from_disk(output_dir)
  print(f"Dataset loaded successfully ✅! Contains {len(test_dataset)} samples")
except Exception as e:
  print(e)
```

It should print the same number of lines that were in the JSONL file.

When using datamixing, run the first job with `max_steps=2`. This will help create optimizations in the cluster for data access and validate that all the datamixes are available.

**How to prepare data for CPT**  
Training data is the most crucial determining factor for the success of continuous pre-training. While CPT data is often described as "unlabeled," the reality is far more nuanced. How data is structured, formatted, and presented determines whether the model will acquire the knowledge and skills required for the business use case.

### Preparing structured business datasets for CPT


This is a common challenge for companies and organizations building foundation models specialized in their domain. Most businesses possess rich repositories of structured data: product catalogs, user profiles, transaction logs, form submissions, API calls, and operational metadata. At first glance, this looks very different from the unstructured web text typically used in standard pre-training.

To effectively learn from structured business data, think carefully about downstream tasks and design the data presentation to force the model to learn the right predictive relationships.

To unlock the full potential of continuous pre-training, consider:
+ What tasks the model should perform at inference time
+ What information is present in the raw data
+ How to structure that data so the model learns to extract and manipulate the information correctly

Simply dumping structured data into training won't teach the model to reason about it. Actively shape the data presentation to guide what the model learns.

In the following sections, there is literature review demonstrating the importance of data augmentation and provide examples augmentation strategies for structured business data that will give useful ideas on how to treat and organize business dataset for CPT.

**Structured data for CPT in the literature**  
CPT can pack domain facts into the model but often fails to make those facts retrievable and manipulable when inputs or tasks shift. Controlled experiments show that without diverse augmentation during pretraining, models memorize facts in brittle ways that remain hard to extract even after later instruction tuning, and they recommend injecting instruction like signals early in training. For semi structured data, randomized serialization and other augmentations reduce schema overfitting, which is why CPT should be interleaved with instruction style tasks rather than run first and IFT later. Finance focused work further finds that jointly mixing CPT and instruction data at batch time improves generalization and reduces forgetting versus the sequential recipe. Qwen technical report converges on the same pattern by integrating high quality instruction data into pretraining itself, which boosts in context learning and preserves instruction following while acquiring new domain knowledge.

Data augmentation for semi structured corpora is a key lever. Synthetic graph aware CPT expands small domain sets into entity linked corpora that explicitly teach relationships and compounds with retrieval at inference time. Joint CPT plus instruction mixing outperforms sequential pipelines in finance and balancing domain with general data lowers degradation on general skills. Very large scale domain CPT can also retain broad ability and even allow trade offs through model merging, yet still points to instruction tuning as an essential next step, reinforcing the value of introducing instruction signals during CPT.

**Injecting diversity through randomization and shuffling**  
A general strategy that helps to teach model effectively from the structured and semi structured datasets is to shuffle the order of fields in the datasets, and even randomly drop out some keys.

Shuffling the fields forces the model to read what each value means instead of where it appears and learn the relationships between all the fields. For example, in case of an video game posted on amazon store, when "Title," "Platform," "Price," "Condition," and "Edition" arrive in different permutations, the model can't rely on "the third slot is platform"; it must bind labels to values and learn the bilateral relationships among attributes: title ⇄ platform, platform ⇄ price, condition ⇄ price. So it can, for example, infer a likely platform from a game name and an observed price, or estimate a plausible price range given a title and platform.

Randomly dropping keys during serialization acts like feature dropout: it prevents co-adaptation on any one field and forces the model to recover missing information from the remaining evidence. If "Platform" is absent, the model must pick it up from the title string or compatibility text; if "Price" is hidden, it has to triangulate from platform, edition, and condition. This builds symmetry (A→B and B→A), robustness to messy real-world listings, and schema invariance when fields are missing, renamed, or reordered.

An shopping-style example makes it concrete. Serialize the same item multiple ways—"Title: 'Elden Ring' \$1 Platform: PlayStation 5 \$1 Condition: Used—Like New \$1 Price: \$134.99" and a permutation like "Price: \$134.99 \$1 Title: 'Elden Ring' \$1 Condition: Used—Like New \$1 Platform: PlayStation 5"—and on some passes drop "Platform" while leaving "Compatible with PS5" in the description. Train complementary objectives such as predicting platform from \$1title, price\$1 and predicting a price bucket from \$1title, platform\$1. Because order and even presence of keys vary, the only stable strategy is to learn the true relationships between attributes rather than memorize a template.

### The way data is presented matters


LLMs learn by predicting the next token from what they have already seen. So the order of fields and events shown during training decides what the model can learn. If the training format matches the real task, the loss lands on the exact decision tokens. If fields are tossed together without structure, the model learns shortcuts or memorizes popularity and then fails when asked to choose among options.

Show the situation first, then the options, then the decision. If the model should also learn about outcomes or explanations, put them after the decision.

### Packing samples for CPT


**What is packing?**  
It simply means to fill each sequence window in the training data with multiple whole examples so the window is dense with real tokens, not padding.

**Why it matters**  
During training a maximum context length is set, for example 8,192 tokens. Batches are shaped to [batch size × context length]. If a training example is shorter than the context length, the remaining positions are padded. Padding still runs through attention and MLP kernels even if loss is masked, so compute is paid for tokens that carry no learning signal.

**How to do packing?**  
To pack multiple samples, concatenate multiple training samples with a ` [DOC] ` separator in between (note the space before and after the [DOC] ) such that the full length of the samples are under the desired context length.

An example packed document would look like this:

```
{"text": "training sample 1 [DOC] training sample 2 [DOC] training sample 3"}
```

### CPT Tuning Parameters


The parameters that are available for fine-tuning with CPT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging
+ **mlflow\$1experiment\$1name**: MLFlow experiment name
+ **mlflow\$1run\$1name**: MLFlow run name

**Training Configuration**  

+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 8192 tokens for CPT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.

**Trainer Settings**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` no. of elements

**Model Settings**  

+ **hidden\$1dropout**: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
+ **attention\$1dropout**: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **name**: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

# Supervised fine-tuning (SFT)
Supervised fine-tuning (SFT)

The SFT training process consists of two main stages:
+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using , this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (epochs, learning rate, batch size)
  + Optional components (distributed training parameters, etc)

## Nova Model Comparison and Selection


Amazon Nova 2.0 is a model trained on a larger and more diverse dataset than Amazon Nova 1.0. Key improvements include:
+ **Enhanced reasoning abilities** with explicit reasoning mode support
+ **Broader multilingual performance** across additional languages
+ **Improved performance on complex tasks** including coding and tool use
+ **Extended context handling** with better accuracy and stability at longer context lengths

## When to Use Nova 1.0 vs. Nova 2.0


Choose Amazon Nova 2.0 when:
+ Superior performance with advanced reasoning capabilities is needed
+ Multilingual support or complex task handling is required
+ Better results on coding, tool calling, or analytical tasks are needed

# SFT on Nova 2.0


Amazon Nova Lite 2.0 brings enhanced capabilities for supervised fine-tuning, including advanced reasoning mode, improved multimodal understanding, and extended context handling. SFT on Nova 2.0 enables you to adapt these powerful capabilities to your specific use cases while maintaining the model's superior performance on complex tasks.

Key features of SFT on Nova 2.0 include:
+ **Reasoning mode support**: Train models to generate explicit reasoning traces before final answers for enhanced analytical capabilities.
+ **Advanced multimodal training**: Fine-tune on document understanding (PDF), video understanding, and image-based tasks with improved accuracy.
+ **Tool calling capabilities**: Train models to effectively use external tools and function calling for complex workflows.
+ **Extended context support**: Leverage longer context windows with better stability and accuracy for document-intensive applications.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

**Topics**
+ [

## Reasoning Mode Selection (Nova 2.0 Only)
](#nova-sft-2-reasoning-mode)
+ [

## Tool calling data format
](#nova-sft-2-tool-calling)
+ [

## Document understanding data format
](#nova-sft-2-document-understanding)
+ [

## Video Understanding for SFT
](#nova-sft-2-video-understanding)
+ [

## Data Upload Instructions
](#nova-sft-2-data-upload)
+ [

## Creating a Fine-Tuning Job
](#nova-sft-2-creating-job)
+ [

## SFT Tuning Parameters
](#nova-sft-2-tuning-parameters)
+ [

## Hyperparameter Guidance
](#nova-sft-2-hyperparameters)

## Sample SFT recipe


Below is a sample recipe for SFT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova) repository.

```
run:
  name: my-full-rank-sft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://my-bucket-name/train.jsonl  #  only and not compatible with SageMaker Training Jobs
  replicas: 4                                     # Number of compute instances for training, allowed values are 4, 8, 16, 32
  output_s3_path: s3://my-bucket-name/outputs/    # Output artifact path (HyperPod job-specific; not compatible with standard SageMaker Training Jobs)
  mlflow_tracking_uri: ""                         # Required for MLFlow
  mlflow_experiment_name: "my-full-rank-sft-experiment"  # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-full-rank-sft-run"         # Optional for MLFlow. Note: leave this field non-empty

training_config:
  max_steps: 100                    # Maximum training steps. Minimal is 4.
  save_steps: ${oc.select:training_config.max_steps}  # How many training steps the checkpoint will be saved
  save_top_k: 5                     # Keep top K best checkpoints. Note supported only for  jobs. Minimal is 1.
  max_length: 32768                 # Sequence length (options: 8192, 16384, 32768 [default], 65536)
  global_batch_size: 32             # Global batch size (options: 32, 64, 128)
  reasoning_enabled: true           # If data has reasoningContent, set to true; otherwise False

  lr_scheduler:
    warmup_steps: 15                # Learning rate warmup steps. Recommend 15% of max_steps
    min_lr: 1e-6                    # Minimum learning rate, must be between 0.0 and 1.0

  optim_config:                     # Optimizer settings
    lr: 1e-5                        # Learning rate, must be between 0.0 and 1.0
    weight_decay: 0.0               # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                  # Exponential decay rate for first-moment estimates
    adam_beta2: 0.95                 # Exponential decay rate for second-moment estimates

  peft:                             # Parameter-efficient fine-tuning (LoRA)
    peft_scheme: "null"             # Disable LoRA for PEFT
```

## Reasoning Mode Selection (Nova 2.0 Only)


Amazon Nova 2.0 supports reasoning mode for enhanced analytical capabilities:
+ **Reasoning Mode (enabled)**:
  + Set `reasoning_enabled: true` in the training configuration
  + Model trains to generate reasoning traces before final answers
  + Improves performance on complex reasoning tasks
+ **Non-Reasoning Mode (disabled)**:
  + Set `reasoning_enabled: false` or omit the parameter (default)
  + Standard SFT without explicit reasoning
  + Suitable for tasks that don't benefit from step-by-step reasoning

**Note**  
When reasoning is enabled, it operates at high reasoning effort. There is no low reasoning option for SFT.
Multimodal reasoning content is not supported for SFT. Reasoning mode applies to text-only inputs.

### Using reasoning mode with non-reasoning datasets


Training Amazon Nova on a non-reasoning dataset with `reasoning_enabled: true` is permitted. However, doing so may cause the model to lose its reasoning capabilities, as Amazon Nova primarily learns to generate the responses presented in the data without applying reasoning.

If training Amazon Nova on a non-reasoning dataset but still want to use reasoning during inference:

1. Disable reasoning during training (`reasoning_enabled: false`)

1. Enable reasoning later during inference

While this approach allows reasoning at inference time, it does not guarantee improved performance compared to inference without reasoning.

**Best practice:** Enable reasoning for both training and inference when using reasoning datasets, and disable it for both when using non-reasoning datasets.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Tool calling data format


SFT supports training models to use tools (function calling). Below is a sample input format for tool calling:

**Sample input:**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are an expert in composing function calls."
    }
  ],
  "toolConfig": {
    "tools": [
      {
        "toolSpec": {
          "name": "getItemCost",
          "description": "Retrieve the cost of an item from the catalog",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "item_name": {
                  "type": "string",
                  "description": "The name of the item to retrieve cost for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to retrieve cost for"
                }
              },
              "required": [
                "item_id"
              ]
            }
          }
        }
      },
      {
        "toolSpec": {
          "name": "getItemAvailability",
          "description": "Retrieve whether an item is available in a given location",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "zipcode": {
                  "type": "string",
                  "description": "The zipcode of the location to check in"
                },
                "quantity": {
                  "type": "integer",
                  "description": "The number of items to check availability for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to check availability for"
                }
              },
              "required": [
                "item_id", "zipcode"
              ]
            }
          }
        }
      }
    ]
  },
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "I need to check whether there are twenty pieces of the following item available. Here is the item ASIN on Amazon: id-123. Please check for the zipcode 94086"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "reasoningContent": {
            "reasoningText": {
              "text": "The user wants to check how many pieces of the item with ASIN id-123 are available in the zipcode 94086"
            }
          }
        },
        {
          "toolUse": {
            "toolUseId": "getItemAvailability_0",
            "name": "getItemAvailability",
            "input": {
              "zipcode": "94086",
              "quantity": 20,
              "item_id": "id-123"
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "toolResult": {
            "toolUseId": "getItemAvailability_0",
            "content": [
              {
                "text": "[{\"name\": \"getItemAvailability\", \"results\": {\"availability\": true}}]"
              }
            ]
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Yes, there are twenty pieces of item id-123 available at 94086. Would you like to place an order or know the total cost?"
        }
      ]
    }
  ]
}
```

Important considerations for tool calling data:
+ ToolUse must appear in assistant turns only
+ ToolResult must appear in user turns only
+ ToolResult should be text or JSON only; other modalities are not currently supported for Amazon Nova models
+ The inputSchema within the toolSpec must be a valid JSON Schema object
+ Each ToolResult must reference a valid toolUseId from a preceding assistant ToolUse, with each toolUseId used exactly once per conversation

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Document understanding data format


SFT supports training models on document understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "document": {
            "format": "pdf",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/documents/customer_service_debugging.pdf",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the document to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for document understanding:
+ Only PDF files are supported
+ Maximum document size is 10 MB
+ A sample can contain documents and text, but cannot mix documents with other modalities (such as images or video)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Video Understanding for SFT


SFT supports fine-tuning models for video understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "video": {
            "format": "mp4",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/videos/customer_service_debugging.mp4",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the video to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for video understanding:
+ Videos can be a maximum of 50 MB
+ Videos can be up to 15 minutes long
+ Only one video is allowed per sample; multiple videos in the same sample are not supported
+ A sample can contain video and text, but cannot mix video with other modalities (such as images or documents)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Data Upload Instructions


Upload training and validation datasets to an S3 bucket. Specify these locations in the recipe's `run` block:

```
## Run config
run:
  ...
  data_s3_path: "s3://<bucket-name>/<training-directory>/<training-file>.jsonl"
```

**Note**: Replace `<bucket-name>`, `<training-directory>`, `<validation-directory>`, `<training-file>`, and `<validation-file>` with actual S3 paths.

**Note**: Validation datasets are not currently supported for SFT with Amazon Nova 2.0. If a validation dataset is provided, it will be ignored.

## Creating a Fine-Tuning Job


Define the base model using the `model_type` and `model_name_or_path` fields in the `run` block:

```
## Run config
run:
  ...
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  ...
```

## SFT Tuning Parameters


The parameters that are available for tuning with SFT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging.
+ **mlflow\$1experiment\$1name**: MLFlow experiment name.
+ **mlflow\$1run\$1name**: MLFlow run name.

**Training Configuration**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` number of elements.
+ **save\$1steps**: The frequency (in steps) at which to save model checkpoints during training.
+ **save\$1top\$1k**: The maximum number of best checkpoints to retain based on validation metrics.
+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value is 32768 tokens for SFT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.
+ **reasoning\$1enabled**: Boolean flag to enable reasoning capabilities during training.

**Learning Rate Scheduler**  

+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **adam\$1beta1**: The exponential decay rate for the first moment estimates in Adam optimizer. Default is 0.9.
+ **adam\$1beta2**: The exponential decay rate for the second moment estimates in Adam optimizer. Default is 0.95.

**PEFT Configuration**  

+ **peft\$1scheme**: The parameter-efficient fine-tuning scheme to use. Options are `'null'` for full-rank fine-tuning or `lora` for LoRA-based fine-tuning.

**LoRA Tuning (when peft\$1scheme is 'lora')**  

+ **alpha**: The LoRA scaling parameter. Controls the magnitude of the low-rank adaptation. Typical values range from 8 to 128.
+ **lora\$1plus\$1lr\$1ratio**: The learning rate ratio for LoRA\$1 optimization. This multiplier adjusts the learning rate specifically for LoRA parameters.

## Hyperparameter Guidance


Use the following recommended hyperparameters based on the training approach:

**Full Rank Training**
+ **Epochs**: 1
+ **Learning rate (lr)**: 1e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**LoRA (Low-Rank Adaptation)**
+ **Epochs**: 2
+ **Learning rate (lr)**: 5e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**Note**: Adjust these values based on dataset size and validation performance. Monitor training metrics to prevent overfitting.

# Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod
Reinforcement Fine-Tuning (RFT)

Reinforcement Fine-Tuning (RFT) is a machine learning technique that improves model performance through feedback signals—measurable scores or rewards indicating response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards.

This approach is particularly effective for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications requiring nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that can be programmatically evaluated.

**When to use RFT**  
Use RFT when you can define clear, measurable success criteria but struggle to provide exact correct outputs for training. It's ideal for tasks where quality is subjective or multifaceted—such as creative writing, code optimization, or complex reasoning—where multiple valid solutions exist but some are clearly better than others.

RFT works best when you have the following:
+ A reliable reward function that can evaluate model outputs programmatically
+ Need to align model behavior with specific preferences or constraints
+ Situations where traditional supervised fine-tuning falls short because collecting high-quality labeled examples is expensive or impractical

Consider RFT for applications requiring iterative improvement, personalization, or adherence to complex business rules that can be encoded as reward signals.

**What RFT is best suited for**  
RFT excels in domains where output quality can be objectively measured but optimal responses are difficult to define upfront:
+ **Mathematical problem-solving**: Verifiable correctness with multiple solution paths
+ **Code generation and optimization**: Testable execution results and performance metrics
+ **Scientific reasoning tasks**: Logical consistency and factual accuracy
+ **Structured data analysis**: Programmatically verifiable outputs
+ **Multi-step reasoning**: Tasks requiring step-by-step logical progression
+ **Tool usage and API calls**: Success measurable by execution results
+ **Complex workflows**: Adherence to specific constraints and business rules

RFT works exceptionally well when you need to balance multiple competing objectives like accuracy, efficiency, and style.

**When to use reasoning mode for RFT training**  
Amazon Nova 2.0 supports reasoning mode during RFT training. The following modes are available:
+ **none**: No reasoning (omit the reasoning\$1effort field)
+ **low**: Minimal reasoning overhead
+ **high**: Maximum reasoning capability (default when reasoning\$1effort is specified)

**Note**  
There is no medium option for RFT. If the reasoning\$1effort field is absent from your configuration, reasoning is disabled.

Use high reasoning for the following:
+ Complex analytical tasks
+ Mathematical problem-solving
+ Multi-step logical deduction
+ Tasks where step-by-step thinking adds value

Use none (omit reasoning\$1effort) or low reasoning for the following:
+ Simple factual queries
+ Direct classifications
+ Speed and cost optimization
+ Straightforward question-answering

**Important**  
Higher reasoning modes increase training time and cost, inference latency and cost, but also increase model capability for complex reasoning tasks.

**Supported models**  
RFT onSageMaker HyperPod supports Amazon Nova Lite 2.0 (amazon.nova-2-lite-v1:0:256k).

**Major steps**  
The RFT process involves four key phases:
+ **Implementing an evaluator**: Create a reward function to programmatically score model responses based on your quality criteria.
+ **Uploading prompts**: Prepare and upload training data in the specified conversational format with reference data for evaluation.
+ **Starting a job**: Launch the reinforcement fine-tuning process with your configured parameters.
+ **Monitoring**: Track training progress through metrics dashboards to ensure the model learns effectively.

Each step builds on the previous one, with the evaluator serving as the foundation that guides the entire training process by providing consistent feedback signals.

**Topics**
+ [

# RFT on Nova 2.0
](nova-hp-rft-nova2.md)

# RFT on Nova 2.0


RFT training data follows the OpenAI conversational format. Each training example is a JSON object containing messages, reference answers, and optional tool definitions. This section provides guidance on preparing effective training data for RFT on Nova 2.0.

**Topics**
+ [

## Data format and structure
](#nova-hp-rft-data-format)
+ [

## Field descriptions
](#nova-hp-rft-field-descriptions)
+ [

## Hyperparameter guidance
](#nova-hp-rft-monitoring-hyperparams)
+ [

## Additional properties
](#nova-hp-rft-additional-properties)
+ [

## Dataset size recommendations
](#nova-hp-rft-dataset-size)
+ [

## Characteristics of effective training data
](#nova-hp-rft-effective-data)
+ [

# Monitoring RFT training
](nova-hp-rft-monitoring.md)

## Data format and structure


Each training example is a JSON object containing the following:
+ **messages**: An array of conversational turns using system, user, and optionally assistant roles
+ **reference\$1answer**: Expected output or evaluation criteria for reward calculation
+ **tools** (optional): Array of function definitions available to the model
+ **id** (optional): Unique identifier for tracking and deduplication

Each example should be on a single line in your JSONL file, with one JSON object per line.

### Example 1: Chemistry problem


The following example shows a chemistry problem with reference answer containing ground truth values:

```
{  
  "id": "chem-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful chemistry assistant"  
    },  
    {  
      "role": "user",  
      "content": "Predict hydrogen bond donors and acceptors for this SMILES: CCN(CC)CCC(=O)c1sc(N)nc1C"  
    }  
  ],  
  "reference_answer": {  
    "donor_bond_counts": 2,  
    "acceptor_bond_counts": 4,  
    "explanation": "Calculated using Lipinski's rule of five: N-H groups (2 donors), N and O atoms with lone pairs (4 acceptors)"  
  }  
}
```

**Note**  
The reference\$1answer contains ground truth values calculated using domain-specific rules. Your reward function compares the model's predicted values against these reference values to calculate a reward score.

### Example 2: Math problem


The following example shows a math problem with solution steps:

```
{  
  "id": "math-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  }  
}
```

### Example 3: Tool usage


The following example shows tool usage with expected behavior:

```
{  
  "id": "tool-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful game master assistant"  
    },  
    {  
      "role": "user",  
      "content": "Generate a strength stat for a warrior character. Apply a +2 racial bonus modifier."  
    }  
  ],  
  "tools": [  
    {  
      "type": "function",  
      "function": {  
        "name": "StatRollAPI",  
        "description": "Generates character stats by rolling 4d6, dropping the lowest die result, and applying a modifier.",  
        "parameters": {  
          "type": "object",  
          "properties": {  
            "modifier": {  
              "description": "An integer representing the modifier to apply to the total of the stat roll.",  
              "type": "integer"  
            }  
          },  
          "required": ["modifier"]  
        }  
      }  
    }  
  ],  
  "reference_answer": {  
    "tool_called": "StatRollAPI",  
    "tool_parameters": {  
      "modifier": 2  
    },  
    "expected_behavior": "Call StatRollAPI with modifier=2 and return the calculated stat value"  
  }  
}
```

## Field descriptions



| Field | Description | Additional notes | Required | 
| --- |--- |--- |--- |
| id | Unique identifier for this RFT example | String (for example, "sample-001"). Useful for tracking and deduplication. | No | 
| messages | Ordered list of chat messages that define the prompt and context | Array of objects. Model sees them in order. Typically starts with a system message, then user. | Yes | 
| messages[].role | Who is speaking in the message | Common values: "system", "user" (sometimes "assistant" in other contexts) | No | 
| messages[].content | The text content of the message | Plain string. For system it's instructions, for user it's the task or input. | No | 
| tools | Tool specifications available to the model during this example | Array. Each item defines a tool's interface and metadata. Types may include "function" or "internal". | No | 
| reference\$1answer | The expected model output for this example | String or object depending on task. Used as target for evaluation or training. | No | 

**Note**  
Any additional custom fields (for example, task\$1id, difficulty\$1level, context\$1data) are not validated and will be passed to your reward function as metadata.

## Hyperparameter guidance


Use the following recommended hyperparameters based on your training approach:

**General:**
+ Epochs: 1
+ Learning rate (lr): 1e-7
+ Number of generations: 8
+ Max new tokens: 8192
+ Batch size: 256

**LoRA (Low-Rank Adaptation):**
+ LoRA Rank: 32

**Note**  
Adjust these values based on your dataset size and validation performance. Monitor training metrics to prevent overfitting.

## Additional properties


The "additionalProperties": true setting allows you to include custom fields beyond the core schema requirements, providing flexibility to add any data your reward function needs for proper evaluation.

### Common additional fields


You can include the following types of additional fields:

**Metadata:**
+ task\$1id: Unique identifier for tracking
+ difficulty\$1level: Problem complexity indicator
+ domain: Subject area or category
+ expected\$1reasoning\$1steps: Number of steps in solution

**Evaluation criteria:**
+ evaluation\$1criteria: Specific grading rubrics
+ custom\$1scoring\$1weights: Relative importance of different aspects
+ context\$1data: Background information for the problem
+ external\$1references: Links to relevant documentation or resources

### Example with additional properties


The following example includes custom metadata fields:

```
{  
  "id": "algebra_001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  },  
  "task_id": "algebra_001",  
  "difficulty_level": "easy",  
  "domain": "algebra",  
  "expected_reasoning_steps": 3  
}
```

## Dataset size recommendations


### Starting point


Begin with the following minimum dataset sizes:
+ Minimum 100 training examples
+ Minimum 100 evaluation examples

Prioritize high-quality input data and a reliable reward function that executes consistently on model responses.

### Evaluation-first approach


Before investing in large-scale RFT training, evaluate your model's baseline performance:
+ **High performance (greater than 95% reward)**: RFT may be unnecessary—your model already performs well
+ **Very poor performance (0% reward)**: Switch to SFT first to establish basic capabilities
+ **Moderate performance**: RFT is likely appropriate

This evaluation-first approach ensures your reward function is bug-free and determines if RFT is the right method for your use case. Starting small allows you to get comfortable with the RFT workflow, identify and fix issues early, validate your approach before scaling up, and test reward function reliability. Once validated, you can expand to larger datasets to further improve performance.

## Characteristics of effective training data


### Clarity and consistency


Good RFT examples require clear, unambiguous input data that enables accurate reward calculation across different model outputs. Avoid noise in your data, including:
+ Inconsistent formatting
+ Contradictory labels or instructions
+ Ambiguous prompts
+ Conflicting reference answers

Any ambiguity will mislead the training process and cause the model to learn unintended behaviors.

### Diversity


Your dataset should capture the full diversity of production use cases to ensure robust real-world performance. Include:
+ Various problem types and difficulty levels
+ Different input formats and edge cases
+ Representative samples from all expected scenarios

This diversity helps prevent overfitting and ensures the model handles unfamiliar inputs gracefully.

### Reward function considerations


Design your reward function for efficient training:
+ Execute within seconds (not minutes)
+ Parallelize effectively with Lambda
+ Return consistent, reliable scores
+ Handle different types of model outputs gracefully

Fast, scalable reward functions enable rapid iteration and cost-effective experimentation at scale.

# Monitoring RFT training


Monitor key metrics during training to ensure effective learning and identify potential issues early.

**Topics**
+ [

## Key metrics to track
](#nova-hp-rft-monitoring-metrics)
+ [

## Evaluation after RFT
](#nova-hp-rft-monitoring-evaluation)
+ [

## Using fine-tuned models
](#nova-hp-rft-monitoring-checkpoints)
+ [

## Limitations and best practices
](#nova-hp-rft-monitoring-limitations)
+ [

## Troubleshooting
](#nova-hp-rft-monitoring-troubleshooting)

## Key metrics to track


Monitor the following metrics using MlFlow during training:

**Reward metrics:**
+ **Average reward score**: Overall quality of model responses (should increase over time)
+ **Reward distribution**: Percentage of responses receiving high, medium, and low rewards
+ **Training vs. validation rewards**: Compare to detect overfitting

**Training metrics:**
+ **Policy updates**: Number of successful weight updates
+ **Rollout completion rate**: Percentage of samples successfully evaluated

**Concerning patterns:**
+ Rewards plateauing (indicates poor learning)
+ Validation rewards dropping while training rewards increase (overfitting)
+ Reward variance increasing significantly over time (instability)
+ High percentage of reward function errors (implementation issues)

**When to stop training:**
+ Target performance metrics are achieved
+ Rewards plateau and no longer improve
+ Validation performance degrades (overfitting detected)
+ Maximum training budget is reached

## Evaluation after RFT


After training completes, evaluate your fine-tuned model to assess performance improvements:
+ **Run RFT evaluation job**: Use the checkpoint from your RFT training as the model
+ **Compare to baseline**: Evaluate both base model and fine-tuned model on the same test set
+ **Analyze metrics**: Review task-specific metrics (accuracy, reward scores, etc.)
+ **Conduct qualitative review**: Manually inspect sample outputs for quality

For detailed evaluation procedures, see the Evaluation section.

## Using fine-tuned models


**Accessing checkpoints:**

After training completes, locate your checkpoint:

1. Navigate to your `output_path` in S3

1. Download and extract `output.tar.gz`

1. Open `manifest.json`

1. Copy the `checkpoint_s3_bucket` value

**Deploying for inference:**

Use the checkpoint S3 path for inference or further training:

```
run:
    model_type: amazon.nova-2-lite-v1:0:256k
    model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

For deployment and inference instructions, refer to the Inference section.

## Limitations and best practices


**Current limitations:**

**Beta restrictions:**
+ Need to create a new RIG group for RFT. This limitation will be resolved by GA.
+ Instance type requirements: Only P5 instances supported (minimum 8x P5.48xlarge). Coming Soon: Support for smaller instance types (ETA: mid-January 2025).

**Functional limitations:**
+ 15-minute Lambda timeout: Reward functions must complete within 15 minutes
+ Single-turn only: Multi-turn conversations not supported
+ Validation datasets: Not supported during training. Use separate evaluation jobs to assess training progress.

**Training considerations:**
+ Low reward scenarios: May struggle when less than 5% of examples receive positive rewards - consider SFT first
+ Data requirements: Needs sufficient diversity to learn effectively
+ Computational cost: More expensive than supervised fine-tuning

**Nova Forge removes some of these limitations:**
+ Supports multi-turn conversations
+ Allows reward functions exceeding 15-minute timeouts
+ Provides advanced algorithms and tuning options
+ Designed for complex enterprise use cases, specifically tuned to build frontier models

**Best practices:**

**Start small and scale:**
+ Begin with minimal datasets (100-200 examples) and few training epochs
+ Validate your approach before scaling up
+ Gradually increase dataset size and training steps based on results

**Baseline with SFT first:**
+ If reward scores are consistently low (e.g., always 0), perform SFT before RFT
+ RFT requires reasonable baseline performance to improve effectively

**Design efficient reward functions:**
+ Execute in seconds, not minutes
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement proper error handling
+ Test thoroughly before training
+ Leverage Lambda's parallel scaling capabilities

**Monitor training actively:**
+ Track average reward scores over time
+ Watch reward distribution across samples
+ Compare training vs. validation rewards
+ Look for concerning patterns (plateaus, overfitting, instability)

**Iterate based on results:**
+ If rewards don't improve after several iterations, adjust reward function design
+ Increase dataset diversity to provide clearer learning signals
+ Consider switching to SFT if rewards remain near zero
+ Experiment with different hyperparameters (learning rate, batch size)

**Optimize data quality:**
+ Ensure diverse, representative examples
+ Include edge cases and difficult samples
+ Verify reward function correctly scores all example types
+ Remove or fix samples that confuse the reward function

## Troubleshooting


**Reward function errors:**

Symptoms: High error rate in reward function calls during training


| Issue | Symptoms | Resolution | 
| --- |--- |--- |
| Lambda timeout | Frequent timeouts after 15 minutes | Optimize function performance; consider Nova Forge for complex evaluations | 
| Insufficient concurrency | Lambda throttling errors | Increase lambda\$1concurrency\$1limit or request quota increase | 
| Invalid return format | Training fails with format errors | Verify return structure matches required interface format | 
| Unhandled exceptions | Intermittent errors | Add comprehensive error handling and logging | 
| External API failures | Inconsistent scoring | Implement retry logic and fallback strategies | 

**Poor training performance:**

Symptoms: Rewards not improving or plateauing at low values

Resolutions:
+ **Verify reward function correctness**: Test with known good/bad examples
+ **Check baseline performance**: Evaluate base model; if near-zero accuracy, do SFT first
+ **Increase data diversity**: Add more varied examples covering different scenarios
+ **Adjust hyperparameters**: Try different learning rates or batch sizes
+ **Review reward signal quality**: Ensure rewards differentiate between good and bad responses

**Overfitting:**

Symptoms: Training rewards increase while validation rewards decrease

Resolutions:
+ **Reduce training steps**: Stop training earlier
+ **Increase dataset size**: Add more training examples
+ **Add regularization**: Adjust `weight_decay` or `entropy_coeff`
+ **Increase data diversity**: Ensure training set represents full distribution

# Evaluating your trained model
Evaluation

An evaluation recipe is a YAML configuration file that defines how your Amazon Nova model evaluation job is executed. With this recipe, you can assess the performance of a base or trained model against common benchmarks or your own custom datasets. Metrics can be stored in Amazon S3 or TensorBoard. The evaluation provides quantitative metrics that help you assess model performance across various tasks to determine if further customization is needed.

Model evaluation is an offline process, where models are tested against fixed benchmarks with predefined answers. They are not assessed in real-time or against live user interactions. For real-time evaluations, you can evaluate the model after it is deployed to Amazon Bedrock by calling the Amazon Bedrock runtime APIs.

**Important**  
The evaluation container only supports checkpoints produced by the same training platform. Checkpoints created with SageMaker HyperPod can only be evaluated using the SageMaker HyperPod evaluation workflow, and checkpoints created with SageMaker training jobs can only be evaluated using the SageMaker training jobs evaluation workflow. Attempting to evaluate a checkpoint from a different platform will result in failure.

**Topics**
+ [

# Available benchmark tasks
](customize-fine-tune-evaluate-available-tasks.md)
+ [

# Understanding the recipe parameters
](customize-fine-tune-evaluate-understand-modify.md)
+ [

# Evaluation recipe examples
](customize-fine-tune-evaluate-recipe-examples.md)
+ [

# Starting an evaluation job
](customize-fine-tune-evaluate-start-job.md)
+ [

# Accessing and analyzing evaluation results
](customize-fine-tune-evaluate-access-results.md)
+ [

# RFT evaluation
](nova-hp-evaluate-rft.md)

# Available benchmark tasks


A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of the supported, available industry standard benchmarks. You can specify the following benchmarks in the `eval_task` parameter:


| Benchmark | Modality | Description | Metrics | Strategy | Subtask Available | 
| --- |--- |--- |--- |--- |--- |
| mmlu | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs\$1cot | Yes | 
| mmlu\$1pro | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs\$1cot | No | 
| bbh | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | zs\$1cot | Yes | 
| gpqa | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs\$1cot | No | 
| math | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text | Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes | 
| IFEval | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No | 
| gen\$1qa | Text | Custom Dataset Evaluation – Lets you bring your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. | all | gen\$1qa | No | 
| llm\$1judge | Text | LLM-as-a-Judge Preference Comparison – Uses a Amazon Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A. | all | judge | No | 
| humaneval | Text | HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models | pass@1 | zs | No | 
|  mm\$1llm\$1judge  |  Multi-modal (image)  |  This new benchmark behaves the same as the text-based `llm_judge` above. The only difference is that it supports image inference.  |  all  |  judge  |  No  | 
|  rubric\$1llm\$1judge  | Text |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  | Text |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  | No | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 

The following `mmlu` subtasks are available:

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

The following `bbh` subtasks are available:

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

The following `math` subtasks are available:

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]
```

# Understanding the recipe parameters


**Run configuration**  
The following is a general run configuration and an explanation of the parameters involved.

```
run:
  name: eval_job_name
  model_type: amazon.nova-micro-v1:0:128k
  model_name_or_path: nova-micro/prod
  replicas: 1
  data_s3_path: ""
  output_s3_path: s3://output_path
  mlflow_tracking_uri: ""
  mlflow_experiment_name : ""
  mlflow_run_name : ""
```
+ `name`: (Required) A descriptive name for your evaluation job. This helps identify your job in the AWS console.
+ `model_type`: (Required) Specifies the Amazon Nova model variant to use. Do not manually modify this field. Options include:
  + `amazon.nova-micro-v1:0:128k`
  + `amazon.nova-lite-v1:0:300k`
  + `amazon.nova-pro-v1:0:300k`
  + `amazon.nova-2-lite-v1:0:256k`
+ `model_name_or_path`: (Required) The path to the base model or S3 path for the post-trained checkpoint. Options include:
  + `nova-micro/prod`
  + `nova-lite/prod`
  + `nova-pro/prod`
  + `nova-lite-2/prod`
  + (S3 path for the post-trained checkpoint) `s3://<escrow bucket>/<job id>/outputs/checkpoints`
+ `replicas`: (Required) The number of compute instances to use for distributed training. You must set this value to 1 because multi-node is not supported.
+ `data_s3_path`: (Required) The S3 path to the input dataset. Leave this parameter empty unless you are using the *bring your own dataset* or *LLM as a judge* recipe.
+ `output_s3_path`: (Required) The S3 path to store output evaluation artifacts. Note that the output S3 bucket must be created by the same account that is creating the job.
+ `mlflow_tracking_uri`: (Optional) MLflow tracking server ARN for tracking MLFlow runs/experiments. Please ensure you have permission to access the tracking server from SageMaker AI execution role

**Evaluation configuration**  
The following is a model evaluation configuration and an explanation of the parameters involved.

```
evaluation:
  task: mmlu
  strategy: zs_cot
  subtask: mathematics
  metric: accuracy
```
+ `task`: (Required) Specifies the evaluation benchmark or task to use.

  Supported task list:
  + mmlu
  + mmlu\$1pro
  + bbh
  + gpqa
  + math
  + strong\$1reject
  + gen\$1qa
  + ifeval
  + llm\$1judge
  + humaneval
  + mm\$1llm\$1judge
  + rubric\$1llm\$1judge
  + aime\$12024
  + calendar\$1scheduling
  + humaneval
+ `strategy`: (Required) Defines the evaluation approach:
  + zs\$1cot: Zero-shot Chain-of-Thought - An approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + zs: Zero-shot - An approach to solve a problem without any prior training examples.
  + gen\$1qa: A strategy specific for bring your own dataset recipes.
  + judge: A strategy specific for Amazon Nova LLM as Judge and mm\$1llm\$1judge.
+ `subtask`: (Optional and Removable) Specifies a specific subtask for certain evaluation tasks. Remove this from your recipe if your task does not have any subtasks.
+ `metric`: (Required) The evaluation metric to use.
  + accuracy: Percentage of correct answers
  + exact\$1match: (For `math` benchmark), returns the rate at which the input predicted strings exactly match their references.
  + deflection: (For `strong reject` benchmark), returns the relative deflection to the base model and the difference in significance metrics.
  + pass@1: (For `humaneval` benchmark) is a metric used to measures the percentage of cases where the model's highest confidence prediction matches the correct answer.
  + `all`: Returns the following metrics:
    + For `gen_qa` and bring your own dataset benchmark, return following metrics:
      + `rouge1`: Measures the overlap of unigrams (single words) between generated and reference text.
      + `rouge2`: Measures the overlap of bigrams (two consecutive words) between generated and reference text.
      + `rougeL`: Measures the longest common subsequence between texts, allowing for gaps in the matching.
      + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
      + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
      + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
      + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
      + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.
    + For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
      + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
      + `a_scores_stderr`: Standard error of `response_A scores` across pairwise judgements.
      + `b_scores`: Number of wins for `response_B` across forward and backward evaluation passes.
      + `b_scores_stderr`: Standard error of `response_B scores` across pairwise judgements.
      + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
      + `ties_stderr`: Standard error of ties across pairwise judgements.
      + `inference_error`: Count of judgements that could not be properly evaluated.
      + `inference_error_stderr`: Standard error of inference errors across judgements.
      + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
      + `score_stderr`: Standard error of the aggregate score across pairwise judgements.
      + `winrate`: the probability that response\$1B will be preferred over response\$1A calculated using Bradley-Terry probability.
      + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration**  
The following is an inference configuration and an explanation of the parameters involved. All parameters are optional.

```
inference:
  max_new_tokens: 200
  top_k: -1
  top_p: 1.0
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: The maximum number of tokens to generate. This must be an integer.
+ `top_k`: The number of highest probability tokens to consider. This must be an integer.
+ `top_p`: The cumulative probability threshold for token sampling. This must be a float between 0.0 and 1.0, inclusive.
+ `temperature`: Randomness in token selection. Larger values introduce more randomness. Use 0 to make the results deterministic. This value must be a float with a minimum value of 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are `null` (default value if not set; disables reasoning), `low`, or `high`.

Note that for `humaneval`, we recommend the following inference configuration:

```
inference:
  top_k: 1
  max_new_tokens: 1600
  temperature: 0.0
```

**MLFlow configuration**  
The following is an MLFlow configuration and an explanation of the parameters involved. All parameters are optional.

```
run:
  mlflow_tracking_uri: ""
  mlflow_experiment_name: ""
  mlflow_run_name: ""
```
+ `mlflow_tracking_uri`: Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

# Evaluation recipe examples


Amazon Nova provides four types of evaluation recipes, which are available in the SageMaker HyperPod recipes GitHub repository.

## General text benchmark recipes


These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. They are provided in the format `xxx_general_text_benchmark_eval.yaml`.

## Bring your own dataset benchmark recipes


These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. They are provided in the format `xxx_bring_your_own_dataset_eval.yaml`. 

The following are the bring your own dataset requirements:
+ File format requirements
  + You must include a single `gen_qa.jsonl` file containing evaluation examples.
  + Your dataset must be uploaded to an S3 location where SageMaker training job can access it.
  + The file must follow the required schema format for a general Q&A dataset.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `query`: (Required) String containing the question or instruction that needs an answer
  + `response`: (Required) String containing the expected model output
  + `system`: (Optional) String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query
  + `metadata`: (Optional) String containing metadata associated with the entry for tagging purposes.

Here is a bring your own data set example entry

```
{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, do not change any of the content:

```
evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 3.5k.

## Nova LLM as a Judge benchmark recipes


Amazon Nova LLM as a Judge is a model evaluation feature that enables customers to compare the quality of responses from one model to a baseline model response on a custom dataset. It takes in a dataset with prompts, baseline responses, and challenger responses, and uses a Amazon Nova Judge model to provide a winrate metric based on [Bradley-Terry probability](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) with pairwise comparisons.

The recipes are provided in the format `xxx_llm_judge_eval.yaml`. 

The following are the LLM as a Judge requirements:
+ File format requirements
  + Include a single `llm_judge.jsonl` file containing evaluation examples. The file name must be `llm_judge.jsonl`.
  + Your dataset must be uploaded to an S3 location that [SageMaker AI SageMaker HyperPod RIG](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html) can access.
  + The file must follow the required schema format for the `llm_judge.jsonl` dataset.
  + The input dataset should ensure all records are under 12k context length.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `prompt`: (Required) A string containing the prompt for the generated response.
  + `response_A`: A string containing the baseline response.
  + `response_B`: A string containing the alternative response be compared with baseline response.

Here is an LLM as a judge example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Amazon Nova Judge models are the same across all model family specifications (that is, Lite, Micro, and Pro).
+ Custom judge models are not supported at this time.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 7k.

## Nova LLM as a Judge for multi-modal (image) benchmark recipes


Nova LLM Judge for multi-modal (image), short for Amazon Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in the form of Base64-encoded string, then uses a Amazon Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge _eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker Training Jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Lite specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

## Rubric Based Judge


Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts (A>B, B>A, or tie), Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

Key capabilities:
+ **Dynamic criteria generation**: Automatically creates relevant evaluation dimensions based on the input prompt
+ **Weighted scoring**: Assigns importance weights to each criterion to reflect their relative significance
+ **Granular assessment**: Provides detailed scores on a binary (true/false) or scale (1-5) basis for each criterion
+ **Quality metrics**: Calculates continuous quality scores (0-1 scale) that quantify the magnitude of differences between responses

Example criterion generated by the model:

```
price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3
```

The model evaluates both responses against all generated criteria, then uses these criterion-level scores to inform its final preference decision.

**Topics**
+ [

### Recipe configuration
](#nova-hp-evaluate-rubric-judge-recipe)
+ [

### Input dataset format
](#nova-hp-evaluate-rubric-judge-input)
+ [

### Evaluation output
](#nova-hp-evaluate-rubric-judge-output)
+ [

### Reasoning model support
](#nova-hp-evaluate-rubric-judge-reasoning)

### Recipe configuration


**Rubric Judge recipe**  
Enable Rubric Judge by setting `task: rubric_llm_judge` in your recipe:

```
run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Original LLM as a Judge recipe (for comparison)**  
The original judge model uses `task: llm_judge`:

```
run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

### Input dataset format


The input dataset format is identical to the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/):

**Required fields:**
+ `prompt`: String containing the input prompt and instructions
+ `response_A`: String containing the baseline model output
+ `response_B`: String containing the customized model output

**Example dataset (JSONL format):**

```
{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}
```

**Format requirements:**
+ Each entry must be a single-line JSON object
+ Separate entries with newlines
+ Follow the exact field naming as shown in examples

### Evaluation output


**Output structure**  
Rubric Judge produces enhanced evaluation metrics compared to the original judge model:

```
{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}
```

**New metrics in Rubric Judge**  
The following six metrics are unique to Rubric Judge and provide granular quality assessment:


| Metric | Description | 
| --- |--- |
| weighted\$1score\$1A | Average normalized quality score for response\$1A across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1A\$1stderr | Standard error of the mean for weighted\$1score\$1A, indicating statistical uncertainty | 
| weighted\$1score\$1B | Average normalized quality score for response\$1B across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1B\$1stderr | Standard error of the mean for weighted\$1score\$1B, indicating statistical uncertainty | 
| score\$1margin | Difference between weighted scores (calculated as weighted\$1score\$1A - weighted\$1score\$1B). Range: -1.0 to 1.0. Positive = response\$1A is better; negative = response\$1B is better; near zero = similar quality | 
| score\$1margin\$1stderr | Standard error of the mean for score\$1margin, indicating uncertainty in the quality difference measurement | 

**Understanding weighted score metrics**  
**Purpose**: Weighted scores provide continuous quality measurements that complement binary preference verdicts, enabling deeper insights into model performance.

**Key differences from original judge**:
+ **Original judge**: Only outputs discrete preferences (A>B, B>A, A=B)
+ **Rubric Judge**: Outputs both preferences AND continuous quality scores (0-1 scale) based on custom criteria

**Interpreting score\$1margin**:
+ `score_margin = -0.128`: Response\$1B scored 12.8 percentage points higher than response\$1A
+ `|score_margin| < 0.1`: Narrow quality difference (close decision)
+ `|score_margin| > 0.2`: Clear quality difference (confident decision)

**Use cases**:
+ **Model improvement**: Identify specific areas where your model underperforms
+ **Quality quantification**: Measure the magnitude of performance gaps, not just win/loss ratios
+ **Confidence assessment**: Distinguish between close decisions and clear quality differences

**Important**  
Final verdicts are still based on the judge model's explicit preference labels to preserve holistic reasoning and ensure proper position bias mitigation through forward/backward evaluation. Weighted scores serve as observability tools, not as replacements for the primary verdict.

**Calculation methodology**  
Weighted scores are computed through the following process:
+ **Extract criterion data**: Parse the judge's YAML output to extract criterion scores and weights
+ **Normalize scores**:
  + Scale-type criteria (1-5): Normalize to 0-1 by calculating `(score - 1) / 4`
  + Binary criteria (true/false): Convert to 1.0/0.0
+ **Apply weights**: Multiply each normalized score by its criterion weight
+ **Aggregate**: Sum all weighted scores for each response
+ **Calculate margin**: Compute `score_margin = weighted_score_A - weighted_score_B`

**Example**: If response\$1A has a weighted sum of 0.65 and response\$1B has 0.78, the `score_margin` would be -0.13, indicating response\$1B is 13 percentage points higher in quality across all weighted criteria.

### Reasoning model support


Reasoning model support enables evaluation with reasoning-capable Amazon Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**:
+ amazon.nova-2-lite-v1:0:256k

**Recipe configuration**  
Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker  SageMaker HyperPod  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker  SageMaker HyperPod  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Using the reasoning\$1effort parameter**  
The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

**Prerequisites**:
+ **Model compatibility**: Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling**: Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

**Available options**:


| Option | Behavior | Token Limit | Use Case | 
| --- |--- |--- |--- |
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 

**When to enable reasoning**  
**Use reasoning mode (`low`, `medium`, or `high`) for**:
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (omit parameter) for**:
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

**Troubleshooting**  
**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# Starting an evaluation job


The following provides a suggested evaluation instance type and model type configuration:

```
# Install Dependencies (Helm - https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

# Install the SageMaker HyperPod CLI
git clone --recurse-submodules https://github.com/aws/sagemaker-hyperpod-cli.git
git checkout -b release_v2
cd sagemaker-hyperpod-cli
pip install .

# Verify the installation
hyperpod --help

# Connect to a SageMaker HyperPod Cluster
hyperpod connect-cluster --cluster-name cluster-name


# Submit the Job using the recipe for eval
# Namespace by default should be kubeflow
hyperpod start-job [--namespace namespace] --recipe evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval --override-parameters \
'{
    "instance_type":"p5d.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest",
    "recipes.run.name": custom-run-name,
    "recipes.run.model_type": model_type,
    "recipes.run.model_name_or_path" " model name or finetune checkpoint s3uri,
    "recipes.run.data_s3_path": s3 for input data only for genqa and llm_judge, must be full S3 path that include filename,
}'

# List jobs
hyperpod list-jobs [--namespace namespace] [--all-namespaces]

# Getting Job details
hyperpod get-job --job-name job-name [--namespace namespace] [--verbose]

# Listing Pods
hyperpod list-pods --job-name job-name --namespace namespace

# Cancel Job
hyperpod cancel-job --job-name job-name [--namespace namespace]
```

You should also be able to view the job status through Amazon EKS cluster console.

# Accessing and analyzing evaluation results


After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the `output_s3_path` (such as `s3://output_path/`) defined in the recipe, the output structure is the following:

```
job_name/
├── eval-result/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

Metrics results are stored in the specified S3 output location `s3://output_path/job_name/eval-result/result-timestamp.json`.

Tensorboard results are stored in the S3 path `s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip`.

All inference outputs, except for `llm_judge` and `strong_reject`, are stored in the S3 path: `s3://output_path/job_name/eval-result/details/model/taskname.parquet`.

For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
+ prompt - The final prompt submitted to the model
+ inference - The raw inference output from the model
+ gold - The target response from the input dataset
+ metadata - The metadata string from the input dataset if provided

To visualize your evaluation metrics in Tensorboard, complete the following steps:

1. Navigate to SageMaker AI Tensorboard.

1. Select **S3 folders**.

1. Add your S3 folder path, for example `s3://output_path/job-name/eval-tensorboard-result/eval`.

1. Wait for synchronization to complete.

The time series, scalars, and text visualizations are available.

We recommend the following best practices:
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

You can find SageMaker HyperPod job error logs in the CloudWatch log group `/aws/sagemaker/Clusters/cluster-id`.

## Log Probability Output Format


When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

# RFT evaluation


**Note**  
Evaluation via remote reward functions in your own AWS environment is only available if you are Amazon Nova Forge customer.

**Important**  
The `rl_env` configuration field is used exclusively for evaluation, not for training. During training, you configure reward functions using `reward_lambda_arn` (single-turn) or BYOO infrastructure with `rollout.delegate: true` (multi-turn).

**What is RFT Evaluation?**  
RFT Evaluation allows you to assess your model's performance using custom reward functions before, during, or after reinforcement learning training. Unlike standard evaluations that use pre-defined metrics, RFT Evaluation lets you define your own success criteria through a Lambda function that scores model outputs based on your specific requirements.

**Why Evaluate with RFT?**  
Evaluation is crucial to determine whether the RL fine-tuning process has:
+ Improved model alignment with your specific use case and human values
+ Maintained or improved model capabilities on key tasks
+ Avoided unintended side effects such as reduced factuality, increased verbosity, or degraded performance on other tasks
+ Met your custom success criteria as defined by your reward function

**When to Use RFT Evaluation**  
Use RFT Evaluation in these scenarios:
+ Before RFT Training: Establish baseline metrics on your evaluation dataset
+ During RFT Training: Monitor training progress with intermediate checkpoints
+ After RFT Training: Validate that the final model meets your requirements
+ Comparing Models: Evaluate multiple model versions using consistent reward criteria

**Note**  
Use RFT Evaluation when you need custom, domain-specific metrics. For general-purpose evaluation (accuracy, perplexity, BLEU), use standard evaluation methods.

**Topics**
+ [

## Data format requirements
](#nova-hp-evaluate-rft-data-format)
+ [

## Preparing your evaluation recipe
](#nova-hp-evaluate-rft-recipe)
+ [

## Preset reward functions
](#nova-hp-evaluate-rft-preset)
+ [

## Creating your reward function
](#nova-hp-evaluate-rft-create-function)
+ [

## IAM permissions
](#nova-hp-evaluate-rft-iam)
+ [

## Executing the evaluation job
](#nova-hp-evaluate-rft-execution)
+ [

## Understanding evaluation results
](#nova-hp-evaluate-rft-results)

## Data format requirements


**Input data structure**  
RFT evaluation input data must follow the OpenAI Reinforcement Fine-Tuning format. Each example is a JSON object containing:
+ `messages`: Array of conversational turns with `system` and `user` roles
+ Optional other metadata, e.g. reference\$1answer

**Data format example**  
The following example shows the required format:

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Solve for x. Return only JSON like {\"x\": <number>}. Equation: 2x + 5 = 13"
        }
      ]
    }
  ],
  "reference_answer": {
    "x": 4
  }
}
```

**Current limitations**  
The following limitations apply to RFT evaluation:
+ Text only: No multimodal inputs (images, audio, video) are supported
+ Single-turn conversations: Only supports single user message (no multi-turn dialogues)
+ JSON format: Input data must be in JSONL format (one JSON object per line)
+ Model outputs: Evaluation is performed on generated completions from the specified model

## Preparing your evaluation recipe


**Sample recipe configuration**  
The following example shows a complete RFT evaluation recipe:

```
run:
  name: nova-lite-rft-eval-job
  model_type: amazon.nova-lite-v1:0:300k
  model_name_or_path: s3://escrow_bucket/model_location    # [MODIFIABLE] S3 path to your model or model identifier
  replicas: 1                                             # [MODIFIABLE] For SageMaker Training jobs only; fixed for  SageMaker HyperPod  jobs
  data_s3_path: ""                                        # [REQUIRED FOR HYPERPOD] Leave empty for SageMaker Training jobs
  output_s3_path: ""                                      # [REQUIRED] Output artifact S3 path for evaluation results

evaluation:
  task: rft_eval                                          # [FIXED] Do not modify
  strategy: rft_eval                                      # [FIXED] Do not modify
  metric: all                                             # [FIXED] Do not modify

# Inference Configuration
inference:
  max_new_tokens: 8196                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                               # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                              # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                          # [MODIFIABLE] Sampling temperature (0 = deterministic)
  top_logprobs: 0

# Evaluation Environment Configuration (NOT used in training)
rl_env:
  reward_lambda_arn: arn:aws:lambda:<region>:<account_id>:function:<reward-function-name>
```

## Preset reward functions


Two preset reward functions (`prime_code` and `prime_math`) are available as a Lambda layer for easy integration with your RFT Lambda functions.

**Overview**  
These preset functions provide out-of-the-box evaluation capabilities for:
+ **prime\$1code**: Code generation and correctness evaluation
+ **prime\$1math**: Mathematical reasoning and problem-solving evaluation

**Quick setup**  
To use preset reward functions:

1. Download the Lambda layer from the [nova-custom-eval-sdk releases](https://github.com/aws/nova-custom-eval-sdk/releases)

1. Publish Lambda layer using AWS CLI:

   ```
   aws lambda publish-layer-version \
       --layer-name preset-function-layer \
       --description "Preset reward function layer with dependencies" \
       --zip-file fileb://universal_reward_layer.zip \
       --compatible-runtimes python3.9 python3.10 python3.11 python3.12 \
       --compatible-architectures x86_64 arm64
   ```

1. Add the layer to your Lambda function in AWS Console (Select the preset-function-layer from custom layer and also add AWSSDKPandas-Python312 for numpy dependencies)

1. Import and use in your Lambda code:

   ```
   from prime_code import compute_score  # For code evaluation
   from prime_math import compute_score  # For math evaluation
   ```

**prime\$1code function**  
**Purpose**: Evaluates Python code generation tasks by executing code against test cases and measuring correctness.

**Example input dataset format from evaluation**:

```
{"messages":[{"role":"user","content":"Write a function that returns the sum of two numbers."}],"reference_answer":{"inputs":["3\n5","10\n-2","0\n0"],"outputs":["8","8","0"]}}
{"messages":[{"role":"user","content":"Write a function to check if a number is even."}],"reference_answer":{"inputs":["4","7","0","-2"],"outputs":["True","False","True","True"]}}
```

**Key features**:
+ Automatic code extraction from markdown code blocks
+ Function detection and call-based testing
+ Test case execution with timeout protection
+ Syntax validation and compilation checks
+ Detailed error reporting with tracebacks

**prime\$1math function**  
**Purpose**: Evaluates mathematical reasoning and problem-solving capabilities with symbolic math support.

**Input format**:

```
{"messages":[{"role":"user","content":"What is the derivative of x^2 + 3x?."}],"reference_answer":"2*x + 3"}
```

**Key features**:
+ Symbolic math evaluation using SymPy
+ Multiple answer formats (LaTeX, plain text, symbolic)
+ Mathematical equivalence checking
+ Expression normalization and simplification

**Best practices**  
Follow these best practices when using preset reward functions:
+ Use proper data types in test cases (integers vs strings, booleans vs "True")
+ Provide clear function signatures in code problems
+ Include edge cases in test inputs (zero, negative numbers, empty inputs)
+ Format math expressions consistently in reference answers
+ Test your reward function with sample data before deployment

## Creating your reward function


**Lambda ARN**  
You must refer to the following format for the Lambda ARN:

```
"arn:aws:lambda:*:*:function:*SageMaker*"
```

If the Lambda does not have this naming scheme, the job will fail with this error:

```
[ERROR] Unexpected error: lambda_arn must contain one of: ['SageMaker', 'sagemaker', 'Sagemaker'] when running on SMHP platform (Key: lambda_arn)
```

**Lambda function structure**  
Your Lambda function receives batches of model outputs and returns reward scores. Below is a sample implementation:

```
from typing import List, Any
import json
import re
from dataclasses import asdict, dataclass


@dataclass
class MetricResult:
    """Individual metric result."""
    name: str
    value: float
    type: str


@dataclass
class RewardOutput:
    """Reward service output."""
    id: str
    aggregate_reward_score: float
    metrics_list: List[MetricResult]


def lambda_handler(event, context):
    """ Main lambda handler """
    return lambda_grader(event)


def lambda_grader(samples: list[dict]) -> list[dict]:
    """ Core grader function """
    scores: List[RewardOutput] = []

    for sample in samples:
        print("Sample: ", json.dumps(sample, indent=2))

        # Extract components
        idx = sample.get("id", "no id")
        if not idx or idx == "no id":
            print(f"ID is None/empty for sample: {sample}")

        ground_truth = sample.get("reference_answer")

        if "messages" not in sample:
            print(f"Messages is None/empty for id: {idx}")
            continue

        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            continue

        # Get model's response (last turn is assistant turn)
        last_message = sample["messages"][-1]

        if last_message["role"] != "nova_assistant":
            print(f"Last message is not from assistant for id: {idx}")
            continue

        if "content" not in last_message:
            print(f"Completion text is empty for id: {idx}")
            continue

        model_text = last_message["content"]

        # --- Actual scoring logic (lexical overlap) ---
        ground_truth_text = _extract_ground_truth_text(ground_truth)

        # Calculate main score and individual metrics
        overlap_score = _lexical_overlap_score(model_text, ground_truth_text)

        # Create two separate metrics as in the first implementation
        accuracy_score = overlap_score  # Use overlap as accuracy
        fluency_score = _calculate_fluency(model_text)  # New function for fluency

        # Create individual metrics
        metrics_list = [
            MetricResult(name="accuracy", value=accuracy_score, type="Metric"),
            MetricResult(name="fluency", value=fluency_score, type="Reward")
        ]

        ro = RewardOutput(
            id=idx,
            aggregate_reward_score=overlap_score,
            metrics_list=metrics_list
        )

        print(f"Response for id: {idx} is {ro}")
        scores.append(ro)

    # Convert to dict format
    result = []
    for score in scores:
        result.append({
            "id": score.id,
            "aggregate_reward_score": score.aggregate_reward_score,
            "metrics_list": [asdict(metric) for metric in score.metrics_list]
        })

    return result


def _extract_ground_truth_text(ground_truth: Any) -> str:
    """
    Turn the `ground_truth` field into a plain string.
    """
    if isinstance(ground_truth, str):
        return ground_truth

    if isinstance(ground_truth, dict):
        # Common patterns: { "explanation": "...", "answer": "..." }
        if "explanation" in ground_truth and isinstance(ground_truth["explanation"], str):
            return ground_truth["explanation"]
        if "answer" in ground_truth and isinstance(ground_truth["answer"], str):
            return ground_truth["answer"]
        # Fallback: stringify the whole dict
        return json.dumps(ground_truth, ensure_ascii=False)

    # Fallback: stringify anything else
    return str(ground_truth)


def _tokenize(text: str) -> List[str]:
    # Very simple tokenizer: lowercase + alphanumeric word chunks
    return re.findall(r"\w+", text.lower())


def _lexical_overlap_score(model_text: str, ground_truth_text: str) -> float:
    """
    Simple lexical overlap score in [0, 1]:
      score = |tokens(model) ∩ tokens(gt)| / |tokens(gt)|
    """
    gt_tokens = _tokenize(ground_truth_text)
    model_tokens = _tokenize(model_text)

    if not gt_tokens:
        return 0.0

    gt_set = set(gt_tokens)
    model_set = set(model_tokens)
    common = gt_set & model_set

    return len(common) / len(gt_set)


def _calculate_fluency(text: str) -> float:
    """
    Calculate a simple fluency score based on:
    - Average word length
    - Text length
    - Sentence structure

    Returns a score between 0 and 1.
    """
    # Simple implementation - could be enhanced with more sophisticated NLP
    words = _tokenize(text)

    if not words:
        return 0.0

    # Average word length normalized to [0,1] range
    # Assumption: average English word is ~5 chars, so normalize around that
    avg_word_len = sum(len(word) for word in words) / len(words)
    word_len_score = min(avg_word_len / 10, 1.0)

    # Text length score - favor reasonable length responses
    ideal_length = 100  # words
    length_score = min(len(words) / ideal_length, 1.0)

    # Simple sentence structure check (periods, question marks, etc.)
    sentence_count = len(re.findall(r'[.!?]+', text)) + 1
    sentence_ratio = min(sentence_count / (len(words) / 15), 1.0)

    # Combine scores
    fluency_score = (word_len_score + length_score + sentence_ratio) / 3

    return fluency_score
```

**Lambda request format**  
Your Lambda function receives data in this format:

```
[
  {
    "id": "sample-001",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Do you have a dedicated security team?"
          }
        ]
      },
      {
        "role": "nova_assistant",
        "content": [
          {
            "type": "text",
            "text": "As an AI developed by Company, I don't have a dedicated security team in the traditional sense. However, the development and deployment of AI systems like me involve extensive security measures, including data encryption, user privacy protection, and other safeguards to ensure safe and responsible use."
          }
        ]
      }
    ],
    "reference_answer": {
      "compliant": "No",
      "explanation": "As an AI developed by Company, I do not have a traditional security team. However, the deployment involves stringent safety measures, such as encryption and privacy safeguards."
    }
  }
]
```

**Note**  
The message structure includes the nested `content` array, matching the input data format. The last message with role `nova_assistant` contains the model's generated response.

**Lambda response format**  
Your Lambda function must return data in this format:

```
[
  {
    "id": "sample-001",
    "aggregate_reward_score": 0.75,
    "metrics_list": [
      {
        "name": "accuracy",
        "value": 0.85,
        "type": "Metric"
      },
      {
        "name": "fluency",
        "value": 0.90,
        "type": "Reward"
      }
    ]
  }
]
```

**Response fields**:
+ `id`: Must match the input sample ID
+ `aggregate_reward_score`: Overall score (typically 0.0 to 1.0)
+ `metrics_list`: Array of individual metrics with:
  + `name`: Metric identifier (e.g., "accuracy", "fluency")
  + `value`: Metric score (typically 0.0 to 1.0)
  + `type`: Either "Metric" (for reporting) or "Reward" (used in training)

## IAM permissions


**Required permissions**  
Your SageMaker AI execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker AI execution role:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction"
      ],
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"
    }
  ]
}
```

**Lambda execution role**  
Your Lambda function's execution role needs basic Lambda execution permissions:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

**Additional permissions**: If your Lambda function accesses other AWS services (e.g., Amazon S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Executing the evaluation job


1. **Prepare your data**
   + Format your evaluation data according to the data format requirements
   + Upload your JSONL file to Amazon S3: `s3://your-bucket/eval-data/eval_data.jsonl`

1. **Configure your recipe**

   Update the sample recipe with your configuration:
   + Set `model_name_or_path` to your model location
   + Set `lambda_arn` to your reward function ARN
   + Set `output_s3_path` to your desired output location
   + Adjust `inference` parameters as needed

   Save the recipe as `rft_eval_recipe.yaml`

1. **Run the evaluation**

   Execute the evaluation job using the provided notebook: [Nova model evaluation notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook)

1. **Monitor progress**

   Monitor your evaluation job through:
   + SageMaker AI Console: Check job status and logs
   + CloudWatch Logs: View detailed execution logs
   + Lambda Logs: Debug reward function issues

## Understanding evaluation results


**Output format**  
The evaluation job outputs results to your specified Amazon S3 location in JSONL format. Each line contains the evaluation results for one sample:

```
{
  "id": "sample-001",
  "aggregate_reward_score": 0.75,
  "metrics_list": [
    {
      "name": "accuracy",
      "value": 0.85,
      "type": "Metric"
    },
    {
      "name": "fluency",
      "value": 0.90,
      "type": "Reward"
    }
  ]
}
```

**Note**  
The RFT Evaluation Job Output is identical to the Lambda Response format. The evaluation service passes through your Lambda function's response without modification, ensuring consistency between your reward calculations and the final results.

**Interpreting results**  
**Aggregate Reward Score**:
+ Range: Typically 0.0 (worst) to 1.0 (best), but depends on your implementation
+ Purpose: Single number summarizing overall performance
+ Usage: Compare models, track improvement over training

**Individual Metrics**:
+ Metric Type: Informational metrics for analysis
+ Reward Type: Metrics used during RFT training
+ Interpretation: Higher values generally indicate better performance (unless you design inverse metrics)

**Performance benchmarks**  
What constitutes "good" performance depends on your use case:


| Score Range | Interpretation | Action | 
| --- |--- |--- |
| 0.8 - 1.0 | Excellent | Model ready for deployment | 
| 0.6 - 0.8 | Good | Minor improvements may be beneficial | 
| 0.4 - 0.6 | Fair | Significant improvement needed | 
| 0.0 - 0.4 | Poor | Review training data and reward function | 

**Important**  
These are general guidelines. Define your own thresholds based on business requirements, baseline model performance, domain-specific constraints, and cost-benefit analysis of further training.

# Monitoring HyperPod jobs with MLflow
MLflow monitoring

You can use MLflow to track and monitor your training jobs on SageMaker HyperPod. Follow these steps to set up MLflow and connect it to your training recipes.

***Create the MLflow App***

Example AWS CLI command

```
aws sagemaker-mlflow create-mlflow-app \
    --name <app-name> \
    --artifact-store-uri <s3-bucket-name> \
    --role-arn <role-arn> \
    --region <region-name>
```

Example output

```
{
    "Arn": "arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ"
}
```

***Generate pre-signed URL***

Example AWS CLI command

```
aws sagemaker-mlflow create-presigned-mlflow-app-url \
    --arn <app-arn> \
    --region <region-name> \
    --output text
```

Example output

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

***Open presigned URL and view the app***

Click 

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

View 

![\[Example Amazon Nova image.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/screenshot-nova-model-1.png)


***Pass to recipe under run block of your SageMaker HyperPod recipe***

Recipe

```
run
    mlflow_tracking_uri: arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ
```

View

![\[Example Amazon Nova image.\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/screenshot-nova-model-2.png)


# Iterative Training
Iterative Training

## Overview


Iterative training is the process of repeatedly fine-tuning a model through multiple training cycles across different training methods — train, evaluate, analyze errors, adjust data/objectives/hyperparameters — with each round starting from the previous checkpoint. This approach allows you to systematically target model failure modes, incorporate curated examples addressing specific weaknesses, and adapt to changing requirements over time.

**Benefits over single-pass training:**
+ **Targeted improvement**: Address specific failure patterns discovered through evaluation
+ **Adaptive refinement**: Respond to distribution shifts or evolving product requirements
+ **Risk mitigation**: Validate improvements incrementally rather than committing to a single long training run
+ **Data efficiency**: Focus data collection efforts on areas where the model underperforms
+ **Curriculum Training**: Multiple round of training with increasingly higher quality data

## How it works


### Checkpoint location and access


After each training job completes, a manifest file is generated in the output location specified by the `output_path` parameter in your training configuration.

**To access your checkpoint**
+ Navigate to your specified `output_path` in S3
+ Download and extract the `output.tar.gz` file
+ Open the `manifest.json` file inside
+ Locate the `checkpoint_s3_bucket` parameter, which contains the S3 URI of your trained model

**Example manifest.json structure**

```
{  
  "checkpoint_s3_bucket": "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>/stepID",  
  ...  
}
```

### Understanding escrow buckets


Since Amazon Nova weights are proprietary, trained model checkpoints are stored in **escrow S3 buckets** within AWS-managed accounts rather than being copied to your account. These escrow buckets:
+ Contain your customized model weights securely
+ Can be referenced by other AWS services (Inference, Evaluation, and subsequent training jobs)
+ Are accessible only to your AWS account via IAM permissions
+ Incur standard S3 storage charges in your account (see Cost considerations)

You can use the escrow bucket path as the `model_name_or_path` in your next training run to continue iterative training.

### Using checkpoints for iterative training


Configure your next training job to use the previous checkpoint as the base model:

```
run:  
  name: "my-iterative-training-job"  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<previous-job-name>"  
  data_s3_path: s3://<bucket>/<data-file>.jsonl  
  replicas: 4
```

## When to use iterative training


### Ideal use cases


Use iterative training when you have:
+ **Feedback loops** – Ability to collect real-world failure cases and systematically address them
+ **Dynamic environments** – Evolving documentation, APIs, or support topics requiring periodic model updates
+ **Robust evaluation** – Strong benchmarks and evaluation frameworks (see examples below) to measure improvements confidently
+ **ML operations capability** – Resources to manage multiple training cycles and version control

**Examples of robust evaluation frameworks**
+ Automated benchmark suites with pass/fail thresholds
+ Human evaluation protocols with inter-rater reliability metrics
+ Red-team testing scenarios covering edge cases and adversarial inputs
+ A/B testing infrastructure to measure production impact

### Common patterns


**SFT → RFT Pipeline**: A frequently used iterative pattern involves:
+ **SFT first** – Teach the model how to solve problems through demonstration examples
+ **RFT second** – Optimize performance across the broader problem space using reward signals

This sequence is essential when models perform poorly initially—RFT on near-zero accuracy models will not improve performance without first establishing basic problem-solving capabilities through SFT.

### When not to use iterative training


Avoid iterative training for:
+ **Stable, well-defined tasks** – Stationary data with consistent requirements already achieving near-ceiling performance
+ **Simple classification problems** – Narrow tasks where single-pass training suffices
+ **Resource constraints** – Lacking dedicated ML operations capabilities to manage multiple training cycles
+ **Marginal gains** – When overhead doesn't justify minimal performance improvements

## Example workflow: SFT → RFT


This example demonstrates a common iterative training pattern for reasoning models.

### Step 1: Initial SFT training


Configure and launch your SFT training job with your dataset:

```
run:  
  name: "initial-sft-training"  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: "nova-lite-2/prod"  
  data_s3_path: s3://<bucket>/sft-training-data.jsonl  
  validation_data_s3_path: s3://<bucket>/sft-validation-data.jsonl
```

**Rationale**: SFT provides additional demonstrations that shape model outputs into your desired format and voice, establishing foundational capabilities.

**After training completes**
+ Note the `output_path` configured in your training job
+ Download `output.tar.gz` from that location
+ Extract and locate `manifest.json`
+ Copy the `checkpoint_s3_bucket` value

### Step 2: RFT training on SFT checkpoint


Create a new RFT training job using the SFT checkpoint:

```
run:  
  name: "rft-on-sft-checkpoint"  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<initial-sft-training>"  
  data_s3_path: s3://<bucket>/rft-training-data.jsonl  
  reward_lambda_arn: <your-reward-function-arn>
```

**Rationale**: RFT training builds on the SFT foundation, allowing the model to develop more complex reasoning patterns optimized by your reward function.

### Step 3: Evaluate and iterate


Run evaluation on the RFT checkpoint to assess performance:

```
run:  
  name: "evaluate-rft-checkpoint"  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<rft-on-sft-checkpoint>"  
  data_s3_path: s3://<bucket>/evaluation-data.jsonl
```

If target metrics are not satisfied, continue iterating with adjusted data or hyperparameters.

**Important**  
The training technique (LoRA vs. Full Rank) must remain consistent across all iterations:  
If you use SFT with **LoRA**, you must use RFT with **LoRA**
If you use SFT with **Full Rank**, you must use RFT with **Full Rank**
You cannot switch between LoRA and Full Rank mid-pipeline

**Important**  
If a KMS key is used for encryption in the Amazon-owned output S3 bucket, that same KMS key must be used for all future iterations.

## Monitoring progress across iterations


You can track metrics via MLflow by [setting up MLflow](nova-model-monitor.md#nova-customization-mlflow-setup) for your jobs.

### Create an MLflow app


**Using Studio UI**: If you create a training job through the Studio UI, a default MLflow app is created automatically and selected by default under Advanced Options.

**Using CLI**: If you use the CLI, you must create an MLflow app and pass it as an input to the training job API request.

```
mlflow_app_name="<enter your MLflow app name>"  
role_arn="<enter your role ARN>"   
bucket_name="<enter your bucket name>"   
region="<enter your region>"  
  
mlflow_app_arn=$(aws sagemaker create-mlflow-app \  
  --name $mlflow_app_name \  
  --artifact-store-uri "s3://$bucket_name" \  
  --role-arn $role_arn \  
  --region $region)
```

### Access the MLflow app


**Using CLI**: Create a presigned URL to access the MLflow app UI:

```
aws sagemaker create-presigned-mlflow-app-url \  
  --arn $mlflow_app_arn \  
  --region $region \  
  --output text
```

**Using Studio UI**: The Studio UI displays key metrics stored in MLflow and provides a link to the MLflow app UI.

### Key metrics to track


Monitor these metrics across iterations to assess improvement and track the job progress:

**For SFT**
+ Training loss curves
+ Number of samples consumed and time to process samples
+ Performance accuracy on held-out test sets
+ Format compliance (for example, valid JSON output rate)
+ Perplexity on domain-specific evaluation data

**For RFT**
+ Average reward scores over training
+ Reward distribution (percentage of high-reward responses)
+ Validation reward trends (watch for overfitting)
+ Task-specific success rates (for example, code execution pass rate, math problem accuracy)

**General**
+ Benchmark performance deltas between iterations
+ Human evaluation scores on representative samples
+ Production metrics (if deploying iteratively)

### Determining when to stop


Stop iterating when:
+ **Performance plateaus** – Additional training no longer meaningfully improves target metrics
+ **Technique switching helps** – If one technique plateaus, try switching (for example, SFT → RFT → SFT) to break through performance ceilings
+ **Target metrics achieved** – Your success criteria are met
+ **Regression detected** – New iterations degrade performance (see rollback procedures below)

For detailed evaluation procedures, refer to the **Evaluation** section.

## Best practices


### Start small and scale gradually


Begin with minimal datasets and single training epochs to validate your approach before scaling up. This builds confidence and helps identify issues early.

### Establish clear success metrics


Define quantitative and qualitative indicators before starting:

**Example success metrics by use case**
+ **Question answering** – Exact match accuracy, F1 score, human preference ratings
+ **Code generation** – Unit test pass rate, compilation success, execution time
+ **Reasoning tasks** – Step accuracy, final answer correctness, reward scores
+ **Content generation** – Coherence scores, factual accuracy, style adherence

### Implement automated evaluation


Set up automated evaluation pipelines to track performance after each round, enabling rapid iteration and objective comparison.

### Maintain rigorous version control


Document for each iteration:
+ Dataset versions and modifications
+ Model checkpoint locations
+ Hyperparameter changes
+ Performance metrics and deltas
+ Qualitative observations

This builds institutional knowledge and enables debugging.

### Focus on data quality over quantity


Analyze failure cases from previous rounds and add targeted, high-quality examples rather than simply increasing dataset size.

### Plan iteration budget


Plan for **3-5 iterations** as a typical range:
+ **1-2 iterations** – Often sufficient for simple improvements or final polishing
+ **3-5 iterations** – Appropriate for complex tasks requiring multiple refinement cycles
+ **5\$1 iterations** – May indicate diminishing returns or need for different approaches

Adjust based on computational budget and performance improvement rates.

### Implement rollback capabilities


If an iteration introduces regressions:
+ **Identify the regression** – Compare evaluation metrics across checkpoints
+ **Return to previous checkpoint** – Use the earlier checkpoint's S3 path as your `model_name_or_path`
+ **Adjust training approach** – Modify data, hyperparameters, or technique before retrying
+ **Document the failure** – Record what caused regression to avoid repeating

**Example rollback**

```
run:  
  name: "rollback-to-iteration-2"  
  model_type: amazon.nova-2-lite-v1:0:256k  
  # Use iteration 2 checkpoint instead of failed iteration 3  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<iteration-2-job-name>"
```

## Cost considerations


### Checkpoint storage

+ **Location** – Checkpoints stored in escrow buckets incur standard S3 storage charges billed to your AWS account
+ **Retention** – Checkpoints are retained indefinitely unless explicitly deleted
+ **Management** – Implement lifecycle policies to archive or delete old checkpoints you no longer need

**Cost optimization tips**
+ Delete intermediate checkpoints after validating newer iterations
+ Archive checkpoints to S3 Glacier for long-term retention at lower cost
+ Set retention policies based on your compliance and experimentation needs

## Limitations


### Model family consistency


When iteratively training, you must use the **same model type** throughout all iterations.

**Initial training**

```
run:  
  model_type: amazon.nova-2-lite-v1:0:256k  
  model_name_or_path: "nova-lite-2/prod"
```

**Subsequent iterations must use the same model\$1type**

```
run:  
  model_type: amazon.nova-2-lite-v1:0:256k  # Must match original  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

### Training technique consistency


The training technique must remain consistent across iterations:
+ **LoRA-trained models** can only be iteratively trained with LoRA
+ **Full-Rank-trained models** can only be iteratively trained with Full-Rank

**How LoRA adapters work in iterative training**
+ Each LoRA training iteration produces new adapter weights
+ New adapters replace (not stack with) previous adapters
+ The base model remains frozen; only adapters are updated

### Technique compatibility matrix



| Initial training | Can iterate with | 
| --- | --- | 
| SFT (Full-Rank) | SFT (Full-Rank), RFT (Full-Rank) | 
| SFT (LoRA) | SFT (LoRA), RFT (LoRA) | 
| RFT (Full-Rank) | RFT (Full-Rank) | 
| RFT (LoRA) | RFT (LoRA) | 

**Verifying compatibility before starting a job**
+ Check your previous training recipe to identify the model type and training technique (LoRA vs. Full-Rank)
+ Ensure your new recipe matches both the model type and technique
+ Review the manifest.json to confirm the checkpoint path is correct

## Troubleshooting


### Error: "Incompatible model training techniques detected"


**Cause**: The training technique (LoRA vs. Full-Rank) doesn't match the checkpoint's technique.

**Resolution**: Ensure your recipe uses the same training technique as the original model:
+ If the checkpoint was trained with LoRA, use LoRA in your new recipe
+ If the checkpoint was trained with Full-Rank, use Full-Rank in your new recipe

### Error: "Base model for the job extracted from model\$1name\$1or\$1path does not match model\$1type"


**Cause**: The model type specified in `model_type` doesn't match the actual model in the checkpoint.

**Resolution**: Verify that:
+ The `model_type` in your recipe matches the original model type
+ The checkpoint S3 path in `model_name_or_path` is correct
+ You're using the path from the correct manifest.json file

**Example of correct configuration**

```
run:  
  model_type: amazon.nova-2-lite-v1:0:256k  # Must match checkpoint's model  
  model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

### Error: "Model configuration not found"


**Cause**: The S3 path in `model_name_or_path` is invalid or inaccessible.

**Resolution**:
+ Verify the S3 path is correctly copied from the manifest.json file
+ Ensure your IAM role has permissions to access the escrow bucket
+ Confirm the previous training job completed successfully
+ Check for typos in the path

### Performance regression after iteration


**Symptoms**: Evaluation metrics decline after a new training iteration.

**Resolution**:
+ **Rollback** – Use the previous checkpoint as your base model
+ **Analyze** – Review training logs and data quality for the failed iteration
+ **Adjust** – Modify hyperparameters (reduce learning rate), improve data quality, or reduce training epochs
+ **Retry** – Launch a new iteration with adjustments

# SageMaker Inference
SageMaker Inference

Custom Amazon Nova models are now available on SageMaker inference. With Amazon Nova on SageMaker, you can start getting predictions, or inferences, from your trained custom Amazon Nova models. SageMaker provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden.

SageMaker provides you with various inference options, such as real-time endpoints for getting low latency inference, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient model deployment and inference. For more information on SageMaker inference, see [Deploy models for inference](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html).

**Important**  
Only full-rank custom models and LoRA-merged models are supported on SageMaker inference. For unmerged LoRA models and base models, use Amazon Bedrock.

## Features


The following features are available for Amazon Nova models on SageMaker inference:

**Model Capabilities**
+ Text generation

**Deployment and Scaling**
+ Real-time endpoints with custom instance selection
+ Auto Scaling – Automatically adjust capacity based on traffic patterns to optimize costs and GPU utilization. For more information, see [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com//sagemaker/latest/dg/endpoint-auto-scaling.html).
+ Streaming API support for real-time token generation

**Monitoring and Optimization**
+ Amazon CloudWatch integration for monitoring and alerts
+ Availability Zone-aware latency optimization through VPC configuration

**Development Tools**
+ AWS CLI support – For more information, see [AWS CLI Command Reference for SageMaker](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/).
+  Notebook integration via SDK support

## Supported models and instances


When creating your SageMaker inference endpoints, you can set two environment variables to configure your deployment: `CONTEXT_LENGTH` and `MAX_CONCURRENCY`.
+ `CONTEXT_LENGTH` – Maximum total token length (input \$1 output) per request
+ `MAX_CONCURRENCY` – Maximum number of concurrent requests the endpoint will serve

The following table lists the supported Amazon Nova models, instance types, and supported configurations. The MAX\$1CONCURRENCY values represent the maximum supported concurrency for each CONTEXT\$1LENGTH setting:


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/nova-model-sagemaker-inference.html)

**Note**  
For instances where FP8 quantization is required, it will be enabled by default.  
The MAX\$1CONCURRENCY values shown are upper bounds for each CONTEXT\$1LENGTH setting. You can use lower context lengths with the same concurrency, but exceeding these values will cause SageMaker endpoint creation to fail.  
For example, on Amazon Nova Micro with a ml.g5.12xlarge:  
`CONTEXT_LENGTH=2000`, `MAX_CONCURRENCY=12` → Valid
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=12` → Rejected (concurrency limit is 6 at context length 8000)
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=4` → Valid
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=6` → Valid
`CONTEXT_LENGTH=10000` → Rejected (max context length is 8000 on this instance)

## Supported AWS Regions


The following table lists the AWS Regions where Amazon Nova models are available on SageMaker inference:


****  

| Region Name | Region Code | Availability | 
| --- | --- | --- | 
| US East (N. Virginia) | us-east-1 | Available | 
| US West (Oregon) | us-west-2 | Available | 

## Supported Container Images


The following table lists the container image URIs for Amazon Nova models on SageMaker inference by region.


****  

| Region | Container Image URIs | 
| --- | --- | 
| us-east-1 | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest | 
| us-west-2 | 176779409107.dkr.ecr.us-west-2.amazonaws.com/nova-inference-repo:SM-Inference-latest | 

## Best Practices


For best practices on deploying and managing models on SageMaker, see [Best Practices for SageMaker](https://docs.aws.amazon.com//sagemaker/latest/dg/best-practices.html).

## Support


For issues and support with Amazon Nova models on SageMaker inference, contact AWS Support through the Console or your AWS account manager.

**Topics**
+ [

## Features
](#nova-sagemaker-inference-features)
+ [

## Supported models and instances
](#nova-sagemaker-inference-supported)
+ [

## Supported AWS Regions
](#nova-sagemaker-inference-regions)
+ [

## Supported Container Images
](#nova-sagemaker-inference-container-images)
+ [

## Best Practices
](#nova-sagemaker-inference-best-practices)
+ [

## Support
](#nova-sagemaker-inference-support)
+ [

# Getting Started
](nova-sagemaker-inference-getting-started.md)
+ [

# API Reference
](nova-sagemaker-inference-api-reference.md)
+ [

# Evaluate Models Hosted on SageMaker Inference
](nova-eval-on-sagemaker-inference.md)
+ [

# Deployment of Amazon Nova Forge Models in Amazon SageMaker Inference abuse detection
](nova-sagemaker-inference-abuse-detection.md)

# Getting Started
Getting started

This guide shows you how to deploy customized Amazon Nova models on SageMaker real-time endpoints, configure inference parameters, and invoke your models for testing.

## Prerequisites


The following are prerequisites to deploy Amazon Nova models on SageMaker inference:
+ Create an AWS account - If you don't have one already, see [Creating an AWS account](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-set-up.html#sign-up-for-aws).
+ Required IAM permissions - Ensure your IAM user or role has the following managed policies attached:
  + `AmazonSageMakerFullAccess`
  + `AmazonS3FullAccess`
+ Required SDKs/CLI versions - The following SDK versions have been tested and validated with Amazon Nova models on SageMaker inference:
  + SageMaker Python SDK v3.0.0\$1 (`sagemaker>=3.0.0`) for resource-based API approach
  + Boto3 version 1.35.0\$1 (`boto3>=1.35.0`) for direct API calls. The examples in this guide use this approach.
+ Service quota increase - Request an Amazon SageMaker service quota increase for the ML instance type you plan to use for your SageMaker Inference endpoint (for example, `ml.p5.48xlarge for endpoint usage`). For a list of supported instance types, see [Supported models and instances](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported). To request an increase, see [Requesting a quota increase](https://docs.aws.amazon.com//servicequotas/latest/userguide/request-quota-increase.html). For information about SageMaker instance quotas, see [SageMaker endpoints and quotas](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html).

## Step 1: Configure AWS credentials


Configure your AWS credentials using one of the following methods:

**Option 1: AWS CLI (Recommended)**

```
aws configure
```

Enter your AWS access key, secret key, and default region when prompted.

**Option 2: AWS credentials file**

Create or edit `~/.aws/credentials`:

```
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

**Option 3: Environment variables**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
```

**Note**  
For more information about AWS credentials, see [Configuration and credential file settings](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html).

**Initialize AWS clients**

Create a Python script or notebook with the following code to initialize the AWS SDK and verify your credentials:

```
import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")
```

If the authentication is successful, you should see output confirming your AWS account ID.

## Step 2: Create a SageMaker execution role


A SageMaker execution role is an IAM role that grants SageMaker permissions to access AWS resources on your behalf, such as Amazon S3 buckets for model artifacts and CloudWatch for logging.

**Creating the execution role**

**Note**  
Creating IAM roles requires `iam:CreateRole` and `iam:AttachRolePolicy` permissions. Ensure your IAM user or role has these permissions before proceeding.

The following code creates an IAM role with the necessary permissions for deploying Amazon Nova customized models:

```
import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")
```

**Using an existing execution role (Optional)**

If you already have a SageMaker execution role, you can use it instead:

```
# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"
```

To find existing SageMaker roles in your account:

```
iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")
```

**Important**  
The execution role must have trust relationship with `sagemaker.amazonaws.com` and permissions to access Amazon S3 and SageMaker resources.

For more information about SageMaker execution roles, see [SageMaker Roles](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-roles.html).

## Step 3: Configure model parameters


Configure the deployment parameters for your Amazon Nova model. These settings control model behavior, resource allocation, and inference characteristics. For a list of supported instance types and supported CONTEXT\$1LENGTH and MAX\$1CONCURRENCY values for each, see [Supported models and instances](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported).

**Required parameters**
+ `IMAGE`: The Docker container image URI for Amazon Nova inference container. This will be provided by AWS.
+ `CONTEXT_LENGTH`: Model context length.
+ `MAX_CONCURRENCY`: Maximum number of sequences per iteration; sets the limit on how many individual user requests (prompts) can be processed concurrently within a single batch on the GPU. Range: integer greater than 0.

**Optional generation parameters**
+ `DEFAULT_TEMPERATURE`: Controls randomness in generation. Range: 0.0 to 2.0 (0.0 = deterministic, higher = more random).
+ `DEFAULT_TOP_P`: Nucleus sampling threshold for token selection. Range: 1e-10 to 1.0.
+ `DEFAULT_TOP_K`: Limits token selection to top K most likely tokens. Range: integer -1 or greater (-1 = no limit).
+ `DEFAULT_MAX_NEW_TOKENS`: Maximum number of tokens to generate in response (i.e. max output tokens). Range: integer 1 or greater.
+ `DEFAULT_LOGPROBS`: Number of log probabilities to return per token. Range: integer 1 to 20.

**Configure your deployment**

```
# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Model Parameters
CONTEXT_LENGTH = "16000"       # Maximum total context length
MAX_CONCURRENCY = "2"          # Maximum concurrent sequences

# Optional: Default generation parameters (uncomment to use)
DEFAULT_TEMPERATURE = "0.0"   # Deterministic output
DEFAULT_TOP_P = "1.0"         # Consider all tokens
# DEFAULT_TOP_K = "50"        # Uncomment to limit to top 50 tokens
# DEFAULT_MAX_NEW_TOKENS = "2048"  # Uncomment to set max output tokens
# DEFAULT_LOGPROBS = "1"      # Uncomment to enable log probabilities

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
}

# Add optional parameters if defined
if 'DEFAULT_TEMPERATURE' in globals():
    environment['DEFAULT_TEMPERATURE'] = DEFAULT_TEMPERATURE
if 'DEFAULT_TOP_P' in globals():
    environment['DEFAULT_TOP_P'] = DEFAULT_TOP_P
if 'DEFAULT_TOP_K' in globals():
    environment['DEFAULT_TOP_K'] = DEFAULT_TOP_K
if 'DEFAULT_MAX_NEW_TOKENS' in globals():
    environment['DEFAULT_MAX_NEW_TOKENS'] = DEFAULT_MAX_NEW_TOKENS
if 'DEFAULT_LOGPROBS' in globals():
    environment['DEFAULT_LOGPROBS'] = DEFAULT_LOGPROBS

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")
```

**Configure deployment-specific parameters**

Now configure the specific parameters for your Amazon Nova model deployment, including model artifacts location and instance type selection.

**Set deployment identifier**

```
# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"
```

**Specify model artifacts location**

Provide the Amazon S3 URI where your trained Amazon Nova model artifacts are stored. This should be the output location from your model training or fine-tuning job.

```
# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"
```

**Select model variant and instance type**

```
# Configure model variant and instance type
TESTCASE = {
    "model": "lite2",              # Options: micro, lite, lite2
    "instance": "ml.p5.48xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")
```

**Naming conventions**

The code automatically generates consistent names for AWS resources:
+ Model Name: `{JOB_NAME}-{model}-{instance-type}`
+ Endpoint Config: `{MODEL_NAME}-Config`
+ Endpoint Name: `{MODEL_NAME}-Endpoint`

## Step 4: Create SageMaker model and endpoint configuration


In this step, you'll create two essential resources: a SageMaker model object that references your Amazon Nova model artifacts, and an endpoint configuration that defines how the model will be deployed.

**SageMaker Model**: A model object that packages the inference container image, model artifacts location, and environment configuration. This is a reusable resource that can be deployed to multiple endpoints.

**Endpoint Configuration**: Defines the infrastructure settings for deployment, including instance type, instance count, and model variants. This allows you to manage deployment settings separately from the model itself.

**Create the SageMaker model**

The following code creates a SageMaker model that references your Amazon Nova model artifacts:

```
try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")
```

Key parameters:
+ `ModelName`: Unique identifier for your model
+ `Image`: Docker container image URI for Amazon Nova inference
+ `ModelDataSource`: Amazon S3 location of your model artifacts
+ `Environment`: Environment variables configured in Step 3
+ `ExecutionRoleArn`: IAM role from Step 2
+ `EnableNetworkIsolation`: Set to True for enhanced security (prevents container from making outbound network calls)

**Create the endpoint configuration**

Next, create an endpoint configuration that defines your deployment infrastructure:

```
# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")
```

Key parameters:
+ `VariantName`: Identifier for this model variant (use 'primary' for single-model deployments)
+ `ModelName`: References the model created above
+ `InitialInstanceCount`: Number of instances to deploy (start with 1, scale later if needed)
+ `InstanceType`: ML instance type selected in Step 3

**Verify resource creation**

You can verify that your resources were created successfully:

```
# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")
```

## Step 5: Deploy the endpoint


The next step is to deploy your Amazon Nova model by creating a SageMaker real-time endpoint. This endpoint will host your model and provide a secure HTTPS endpoint for making inference requests.

Endpoint creation typically takes 15-30 minutes as AWS provisions the infrastructure, downloads your model artifacts, and initializes the inference container.

**Create the endpoint**

```
import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")
```

**Monitor endpoint creation**

The following code polls the endpoint status until deployment is complete:

```
# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds
```

**Verify endpoint is ready**

Once the endpoint is InService, you can verify its configuration:

```
# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break
```

**Troubleshooting endpoint creation failures**

Common failure reasons:
+ **Insufficient capacity**: The requested instance type is not available in your region
  + Solution: Try a different instance type or request a quota increase
+ **IAM permissions**: The execution role lacks necessary permissions
  + Solution: Verify the role has access to Amazon S3 model artifacts and necessary SageMaker permissions
+ **Model artifacts not found**: The Amazon S3 URI is incorrect or inaccessible
  + Solution: Verify the Amazon S3 URI and check bucket permissions, make sure you're in the correct region
+ **Resource limits**: Account limits exceeded for endpoints or instances
  + Solution: Request a service quota increase through Service Quotas or AWS Support

**Note**  
If you need to delete a failed endpoint and start over:  

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## Step 6: Invoke the endpoint


Once your endpoint is InService, you can send inference requests to generate predictions from your Amazon Nova model. SageMaker supports synchronous endpoints (real-time with streaming/non-streaming modes) and asynchronous endpoints (Amazon S3-based for batch processing).

**Set up the runtime client**

Create a SageMaker Runtime client with appropriate timeout settings:

```
import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)
```

**Create a universal inference function**

The following function handles both streaming and non-streaming requests:

```
def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Body=body
            )
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            response = runtime_client.invoke_endpoint(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Accept='application/json',
                Body=body
            )
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")
```

**Example 1: Non-streaming chat completion**

Use the chat format for conversational interactions:

```
# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "reasoning_effort": "low",  # Options: "low", "high"
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)
```

**Sample response:**

```
{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}
```

**Example 2: Simple text completion**

Use the completion format for simple text generation:

```
# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)
```

**Sample response:**

```
{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}
```

**Example 3: Streaming chat completion**

```
# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "reasoning_effort": "high",  # For more detailed reasoning
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)
```

**Sample streaming output:**

```
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]
```

**Example 4: Multimodal chat completion**

Use multimodal format for image and text inputs:

```
# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)
```

**Sample response:**

```
{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}
```

## Step 7: Clean up resources (Optional)


To avoid incurring unnecessary charges, delete the AWS resources you created during this tutorial. SageMaker endpoints incur charges while they're running, even if you're not actively making inference requests.

**Important**  
Deleting resources is permanent and cannot be undone. Ensure you no longer need these resources before proceeding.

**Delete the endpoint**

```
import boto3

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")
```

**Note**  
The endpoint deletion is asynchronous. You can monitor the deletion status:  

```
import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break
```

**Delete the endpoint configuration**

After the endpoint is deleted, remove the endpoint configuration:

```
try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")
```

**Delete the model**

Remove the SageMaker model object:

```
try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")
```

# API Reference
API reference

Amazon Nova models on SageMaker use the standard SageMaker Runtime API for inference. For complete API documentation, see [Test a deployed model](https://docs.aws.amazon.com//sagemaker/latest/dg/realtime-endpoints-test-endpoints.html).

## Endpoint invocation


Amazon Nova models on SageMaker support two invocation methods:
+ **Synchronous invocation**: Use the [InvokeEndpoint](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API for real-time, non-streaming inference requests.
+ **Streaming invocation**: Use the [InvokeEndpointWithResponseStream](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html) API for real-time streaming inference requests.

## Request format


Amazon Nova models support two request formats:

**Chat completion format**

Use this format for conversational interactions:

```
{
  "messages": [
    {"role": "user", "content": "string"}
  ],
  "max_tokens": integer,
  "max_completion_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": boolean,
  "top_logprobs": integer,
  "reasoning_effort": "low" | "high",
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**Text completion format**

Use this format for simple text generation:

```
{
  "prompt": "string",
  "max_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": integer,
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**Multimodal chat completion format**

Use this format for image and text inputs:

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": integer,
  "temperature": float,
  "top_p": float,
  "stream": boolean
}
```

**Request parameters**
+ `messages` (array): For chat completion format. Array of message objects with `role` and `content` fields. Content can be a string for text-only or an array for multimodal inputs.
+ `prompt` (string): For text completion format. The input text to generate from.
+ `max_tokens` (integer): Maximum number of tokens to generate in the response. Range: 1 or greater.
+ `max_completion_tokens` (integer): Alternative to max\$1tokens for chat completions. Maximum number of completion tokens to generate.
+ `temperature` (float): Controls randomness in generation. Range: 0.0 to 2.0 (0.0 = deterministic, 2.0 = maximum randomness).
+ `top_p` (float): Nucleus sampling threshold. Range: 1e-10 to 1.0.
+ `top_k` (integer): Limits token selection to top K most likely tokens. Range: -1 or greater (-1 = no limit).
+ `stream` (boolean): Whether to stream the response. Set to `true` for streaming, `false` for non-streaming.
+ `logprobs` (boolean/integer): For chat completions, use boolean. For text completions, use integer for number of log probabilities to return. Range: 1 to 20.
+ `top_logprobs` (integer): Number of most likely tokens to return log probabilities for (chat completions only).
+ `reasoning_effort` (string): Level of reasoning effort. Options: "low", "high" (chat completions for Nova 2 Lite custom models only).
+ `allowed_token_ids` (array): List of token IDs that are allowed to be generated. Restricts output to specified tokens.
+ `truncate_prompt_tokens` (integer): Truncate the prompt to this many tokens if it exceeds the limit.
+ `stream_options` (object): Options for streaming responses. Contains `include_usage` boolean to include token usage in streaming responses.

## Response format


The response format depends on the invocation method and request type:

**Chat completion response (non-streaming)**

For synchronous chat completion requests:

```
{
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": [9906, 0, 358, 2157, 1049, 11, 1309, 345, 369, 6464, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": [9906, 0, 358]
}
```

**Text completion response (non-streaming)**

For synchronous text completion requests:

```
{
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris, the capital and most populous city of France.",
      "logprobs": {
        "tokens": ["Paris", ",", " the", " capital"],
        "token_logprobs": [-0.31725305, -0.07918124, -0.12345678, -0.23456789],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403,
            "Rome": -2.1234567
          },
          {
            ",": -0.07918124,
            " is": -1.2345678
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_token_ids": [464, 6864, 315, 4881, 374],
      "token_ids": [3915, 11, 279, 6864, 323, 1455, 95551, 3363, 315, 4881, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}
```

**Chat completion streaming response**

For streaming chat completion requests, responses are sent as Server-Sent Events (SSE):

```
data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Hello",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              }
            ]
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null,
  "prompt_token_ids": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "! I'm"
      },
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}

data: [DONE]
```

**Text completion streaming response**

For streaming text completion requests:

```
data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris",
      "logprobs": {
        "tokens": ["Paris"],
        "token_logprobs": [-0.31725305],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": ", the capital",
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "",
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16
  }
}

data: [DONE]
```

**Response fields explanation**
+ `id`: Unique identifier for the completion
+ `object`: Type of object returned ("chat.completion", "text\$1completion", "chat.completion.chunk")
+ `created`: Unix timestamp of when the completion was created
+ `model`: Model used for the completion
+ `choices`: Array of completion choices
+ `usage`: Token usage information including prompt, completion, and total tokens
+ `logprobs`: Log probability information for tokens (when requested)
+ `finish_reason`: Reason why the model stopped generating ("stop", "length", "content\$1filter")
+ `delta`: Incremental content in streaming responses
+ `reasoning`: Reasoning content when reasoning\$1effort is used
+ `token_ids`: Array of token IDs for the generated text

For complete API documentation, see [InvokeEndpoint API reference](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) and [InvokeEndpointWithResponseStream API reference](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html).

# Evaluate Models Hosted on SageMaker Inference
Evaluate models

This guide explains how to evaluate your customized Amazon Nova models deployed on SageMaker inference endpoints using [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai), an open-source evaluation framework.

**Note**  
For a hands-on walkthrough, see the [SageMaker Inspect AI quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb).

## Overview


You can evaluate your customized Amazon Nova models deployed on SageMaker endpoints using standardized benchmarks from the AI research community. This approach enables you to:
+ Evaluate customized Amazon Nova models (fine-tuned, distilled, or otherwise adapted) at scale
+ Run evaluations with parallel inference across multiple endpoint instances
+ Compare model performance using benchmarks like MMLU, TruthfulQA, and HumanEval
+ Integrate with your existing SageMaker infrastructure

## Supported models


The SageMaker inference provider works with:
+ Amazon Nova models (Nova Micro, Nova Lite, Nova Lite 2)
+ Models deployed via vLLM or OpenAI-compatible inference servers
+ Any endpoint that supports the OpenAI Chat Completions API format

## Prerequisites


Before you begin, ensure you have:
+ An AWS account with permissions to create and invoke SageMaker endpoints
+ AWS credentials configured via AWS CLI, environment variables, or IAM role
+ Python 3.9 or higher

**Required IAM permissions**

Your IAM user or role needs the following permissions:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:*:*:endpoint/*"
    }
  ]
}
```

## Step 1: Deploy a SageMaker endpoint


Before running evaluations, you need a SageMaker inference endpoint running your model.

For instructions on creating a SageMaker inference endpoint with Amazon Nova models, see [Getting Started](nova-sagemaker-inference-getting-started.md).

Once your endpoint is in `InService` status, note the endpoint name for use in the evaluation commands.

## Step 2: Install evaluation dependencies


Create a Python virtual environment and install the required packages.

```
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install uv for faster package installation
pip install uv

# Install Inspect AI and evaluation benchmarks
uv pip install inspect-ai inspect-evals

# Install AWS dependencies
uv pip install aioboto3 boto3 botocore openai
```

## Step 3: Configure AWS credentials


Choose one of the following authentication methods:

**Option 1: AWS CLI (Recommended)**

```
aws configure
```

Enter your AWS Access Key ID, Secret Access Key, and default region when prompted.

**Option 2: Environment variables**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
```

**Option 3: IAM role**

If running on Amazon EC2 or SageMaker notebooks, the instance's IAM role is used automatically.

**Verify credentials**

```
import boto3

sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Account: {identity['Account']}")
print(f"User/Role: {identity['Arn']}")
```

## Step 4: Install the SageMaker provider


The SageMaker provider enables Inspect AI to communicate with your SageMaker endpoints. The provider installation process is streamlined in the [quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb).

## Step 5: Download evaluation benchmarks


Clone the Inspect Evals repository to access standard benchmarks:

```
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
```

This repository includes benchmarks such as:
+ MMLU and MMLU-Pro (knowledge and reasoning)
+ TruthfulQA (truthfulness)
+ HumanEval (code generation)
+ GSM8K (mathematical reasoning)

## Step 6: Run evaluations


Run an evaluation using your SageMaker endpoint:

```
cd inspect_evals/src/inspect_evals/

inspect eval mmlu_pro/mmlu_pro.py \
  --model sagemaker/my-nova-endpoint \
  -M region_name=us-west-2 \
  --max-connections 256 \
  --max-retries 100 \
  --display plain
```

**Key parameters**


| Parameter | Default | Description | 
| --- | --- | --- | 
| --max-connections | 10 | Number of parallel requests to the endpoint. Scale with instance count (e.g., 10 instances × 25 = 250). | 
| --max-retries | 3 | Retry attempts for failed requests. Use 50-100 for large evaluations. | 
| -M region\$1name | us-east-1 | AWS region where your endpoint is deployed. | 
| -M read\$1timeout | 600 | Request timeout in seconds. | 
| -M connect\$1timeout | 60 | Connection timeout in seconds. | 

**Tuning recommendations**

For a multi-instance endpoint:

```
# 10-instance endpoint example
--max-connections 250   # ~25 connections per instance
--max-retries 100       # Handle transient errors
```

Setting `--max-connections` too high may overwhelm the endpoint and cause throttling. Setting it too low underutilizes capacity.

## Step 7: View results


Launch the Inspect AI viewer to analyze evaluation results:

```
inspect view
```

The viewer displays:
+ Overall scores and metrics
+ Per-sample results with model responses
+ Error analysis and failure patterns

## Managing endpoints


**Update an endpoint**

To update an existing endpoint with a new model or configuration:

```
import boto3

sagemaker = boto3.client('sagemaker', region_name=REGION)

# Create new model and endpoint configuration
# Then update the endpoint
sagemaker.update_endpoint(
    EndpointName=EXISTING_ENDPOINT_NAME,
    EndpointConfigName=NEW_ENDPOINT_CONFIG_NAME
)
```

**Delete an endpoint**

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## Onboarding custom benchmarks


You can add new benchmarks to Inspect AI using the following workflow:

1. Study the benchmark's dataset format and evaluation metrics

1. Review similar implementations in `inspect_evals/`

1. Create a task file that converts dataset records to Inspect AI samples

1. Implement appropriate solvers and scorers

1. Validate with a small test run

Example task structure:

```
from inspect_ai import Task, task
from inspect_ai.dataset import hf_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def my_benchmark():
    return Task(
        dataset=hf_dataset("dataset_name", split="test"),
        solver=multiple_choice(),
        scorer=choice()
    )
```

## Troubleshooting


**Common issues**

**Endpoint throttling or timeouts**
+ Reduce `--max-connections`
+ Increase `--max-retries`
+ Check endpoint CloudWatch metrics for capacity issues

**Authentication errors**
+ Verify AWS credentials are configured correctly
+ Check IAM permissions include `sagemaker:InvokeEndpoint`

**Model errors**
+ Verify the endpoint is in `InService` status
+ Check that the model supports the OpenAI Chat Completions API format

## Related resources

+ [Inspect AI Documentation](https://inspect.ai-safety-institute.org.uk/)
+ [Inspect Evals Repository](https://github.com/UKGovernmentBEIS/inspect_evals)
+ [SageMaker Developer Guide](https://docs.aws.amazon.com//sagemaker/latest/dg/whatis.html)
+ [Deploy Models for Inference](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)
+ [Configuring the AWS CLI](https://docs.aws.amazon.com//cli/latest/userguide/cli-chap-configure.html)

# Deployment of Amazon Nova Forge Models in Amazon SageMaker Inference abuse detection
Abuse detection for Amazon Nova Forge

AWS is committed to the responsible use of AI. To help prevent potential misuse, when you deploy Amazon Nova Forge Models in Amazon SageMaker Inference, SageMaker Inference implements automated abuse detection mechanisms to identify potential violations of AWS's [Acceptable Use Policy](https://aws.amazon.com/aup/) (AUP) and Service Terms, including the [Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/).

Our abuse detection mechanisms are fully automated, so there is no human review of, or access to, user inputs or model outputs.

Automated abuse detection includes:
+ **Categorize content** – We use classifiers to detect harmful content (such as content that incites violence) in user inputs and model outputs. A classifier is an algorithm that processes model inputs and outputs, and assigns type of harm and level of confidence. We may run these classifiers on Amazon Nova Forge Model usage. The classification process is automated and does not involve human review of user inputs or model outputs.
+ **Identify patterns** – We use classifier metrics to identify potential violations and recurring behavior. We may compile and share anonymized classifier metrics. Amazon SageMaker Inference does not store user input or model output.
+ **Detecting and blocking child sexual abuse material (CSAM)** – You are responsible for the content you (and your end users) upload to Amazon SageMaker Inference and must ensure this content is free from illegal images. To help stop the dissemination of CSAM, when deploying an Amazon Nova Forge Model in Amazon SageMaker Inference, SageMaker Inference may use automated abuse detection mechanisms (such as hash matching technology or classifiers) to detect apparent CSAM. If Amazon SageMaker Inference detects apparent CSAM in your image inputs, Amazon SageMaker Inference will block the request and you will receive an automated error message. Amazon SageMaker Inference may also file a report with the National Center for Missing and Exploited Children (NCMEC) or a relevant authority. We take CSAM seriously and will continue to update our detection, blocking, and reporting mechanisms. You might be required by applicable laws to take additional actions, and you are responsible for those actions.

Once our automated abuse detection mechanisms identify potential violations, we may request information about your use of Amazon SageMaker Inference and compliance with our terms of service. In the event that you are non-responsive, unwilling, or unable to comply with these terms or policies, AWS may suspend your access to Amazon SageMaker Inference. You may also be billed for the failed inference job if our automated tests detect model responses being inconsistent with our terms and policies.

Contact AWS Support if you have additional questions. For more information, see the [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/ai/faqs/).

# Amazon Bedrock inference
Amazon Bedrock inference

Once you’ve trained and tested your Amazon Nova model, you can deploy it to Amazon Bedrock for production-scale inference. The deployment process involves creating an Amazon Bedrock model with the CreateCustomModel API, exporting your model artifacts to it from a managed Amazon S3 bucket, and then once the model is ACTIVE configuring an endpoint with on-demand or provisioned-throughput inference.

After you create a custom model in SageMaker, you can use the CreateCustomModel API to deploy it to Amazon Bedrock from SageMaker escrow to run inference. You can then use CreateCustomModelDeployment to create an OD inference endpoint or set up provisioned throughput inference for a Parameter Efficient Fine Tuned (PEFT) model. You can set up provisioned throughput inference for a Full Rank custom model.

You can also use the Amazon Nova Forge SDK to deploy customized Amazon Nova models. The Amazon Nova Forge SDK provides a streamlined experience for extracting the relevant information from a training job or S3 model checkpoint and publishing it to Amazon Bedrock. For more information, see [Amazon Nova Forge SDK](nova-forge-sdk.md).

For detailed steps to set up Amazon Bedrock inference for a custom model, see [Deploying customized Amazon Nova models to Amazon Bedrock](deploy-custom-model.md).

The following section gives more detail about On-Demand Inference on Custom Models.

# Deploy a custom model for on-demand inference


After you successfully create a custom model with a model customization job (fine-tuning, distillation, or continued pre-training), you can set up on-demand inference for the model.

To set up on-demand inference for a custom model, you deploy the model with a custom model deployment. After you deploy your custom model, you use the deployment's Amazon Resource Name (ARN) as the `modelId` parameter in your `InvokeModel` or `Converse` API operations. You can use the deployed model for on-demand inference with Amazon Bedrock features such as playgrounds, Agents, and Knowledge Bases. 

**Topics**
+ [

## Supported models
](#custom-model-inference-supported-models)
+ [

# Deploy a custom model
](deploying-custom-model.md)
+ [

# Use a deployment for on-demand inference
](use-custom-model-on-demand.md)
+ [

# Delete a custom model deployment
](delete-custom-model-deployment.md)

## Supported models


You can set up on-demand inference for the following models:
+ Amazon Nova Canvas
+ Amazon Nova Lite
+ Amazon Nova Micro
+ Amazon Nova Pro

# Deploy a custom model


You can deploy a custom model with the Amazon Bedrock console, AWS Command Line Interface, or AWS SDKs. For information about using the deployment for inference, see [Use a deployment for on-demand inference](https://docs.aws.amazon.com/bedrock/latest/userguide/use-custom-model-on-demand.html). 

**Topics**
+ [

## Deploy a custom model (console)
](#deploy-custom-model-console)
+ [

## Deploy a custom model (AWS Command Line Interface)
](#deploy-custom-model-cli)
+ [

## Deploy a custom model (AWS SDKs)
](#deploy-custom-model-sdk)

## Deploy a custom model (console)


You deploy a custom model from the **Custom models** page as follows. You can also deploy a model from the **Custom model on-demand** page with the same fields. To find this page, in **Inference and Assessment** in the navigation pane, choose **Custom model on-demand**.

**To deploy a custom model**

1. Sign in to the AWS Management Console using an [IAM role with Amazon Bedrock permissions](https://docs.aws.amazon.com//bedrock/latest/userguide/getting-started.html), and open the Amazon Bedrock console at [https://console.aws.amazon.com/nova/](https://console.aws.amazon.com/nova/).

1. From the left navigation pane, choose **Custom models** under **Foundation models**.

1. In the **Models** tab, choose the radio button for the model you want to deploy.

1. Choose **Set up inference** and choose **Deploy for on-demand**.

1. In **Deployment details**, provide the following information:
   + **Deployment Name** (required) – Enter a unique name for your deployment.
   + **Description** (optional) – Enter a description for your deployment.
   + **Tags** (optional) – Add tags for cost allocation and resource management.

1. Choose **Create**. When the status shows `Completed`, your custom model is ready for on-demand inference. For more information about using the custom model, see [Use a deployment for on-demand inference](https://docs.aws.amazon.com/bedrock/latest/userguide/use-custom-model-on-demand.html).

## Deploy a custom model (AWS Command Line Interface)


To deploy a custom model for on-demand inference using the AWS Command Line Interface, use the `create-custom-model-deployment` command with your custom model's Amazon Resource Name (ARN). This command uses the [CreateCustomModelDeployment](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_CreateCustomModelDeployment.html) API operation. It returns the deployment's ARN that you can use as the `modelId` when making inference requests. For information about using the deployment for inference, see [Use a deployment for on-demand inference](https://docs.aws.amazon.com/bedrock/latest/userguide/use-custom-model-on-demand.html).

```
aws bedrock create-custom-model-deployment \
--model-deployment-name "Unique name" \
--model-arn "Custom Model ARN" \
--description "Deployment description" \
--tags '[
    {
        "key": "Environment",
        "value": "Production"
    },
    {
        "key": "Team",
        "value": "ML-Engineering"
    },
    {
        "key": "Project",
        "value": "CustomerSupport"
    }
]' \
--client-request-token "unique-deployment-token" \
--region region
```

## Deploy a custom model (AWS SDKs)


To deploy a custom model for on-demand inference, use the [CreateCustomModelDeployment](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_CreateCustomModelDeployment.html) API operation with your custom model's Amazon Resource Name (ARN). The response returns the deployment's ARN that you can use as the `modelId` when making inference requests. For information about using the deployment for inference, see [Use a deployment for on-demand inference](https://docs.aws.amazon.com/bedrock/latest/userguide/use-custom-model-on-demand.html).

The following code shows how to use the SDK for Python (Boto3) to deploy a custom model. 

```
def create_custom_model_deployment(bedrock_client):
    """Create a custom model deployment
    Args:
        bedrock_client: A boto3 Bedrock client for making API calls
 
    Returns:
        str: The ARN of the created custom model deployment
 
    Raises:
        Exception: If there is an error creating the deployment
    """
 
    try:
        response = bedrock_client.create_custom_model_deployment(
            modelDeploymentName="Unique deployment name",
            modelArn="Custom Model ARN",
            description="Deployment description",
            tags=[
                {'key': 'Environment', 'value': 'Production'},
                {'key': 'Team', 'value': 'ML-Engineering'},
                {'key': 'Project', 'value': 'CustomerSupport'}
            ],
            clientRequestToken=f"deployment-{uuid.uuid4()}"
        )
 
        deployment_arn = response['customModelDeploymentArn']
        print(f"Deployment created: {deployment_arn}")
        return deployment_arn
 
    except Exception as e:
        print(f"Error creating deployment: {str(e)}")
        raise
```

# Use a deployment for on-demand inference


After you deploy your custom model for on-demand inference, you can use it to generate responses by making inference requests. For `InvokeModel` or `Converse` operations, you use the deployment Amazon Resource Name (ARN) as the `modelId`.

For information about making inference requests, see the following topics:
+ [Submit prompts and generate responses with model inference](https://docs.aws.amazon.com/bedrock/latest/userguide/inference.html)
+ [Prerequisites for running model inference](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-prereq.html)
+ [Submit prompts and generate responses using the API](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-api.html)

# Delete a custom model deployment


After you are finished using your model for on-demand inference, you can delete the deployment. After you delete the deployment, you can't use it for on-demand inference but it doesn't delete the underlying custom model.

You can delete a custom model deployment with the Amazon Bedrock console, AWS Command Line Interface, or AWS SDKs.

**Important**  
Deleting a custom model deployment is irreversible. Make sure you no longer need the deployment before proceeding with the deletion. If you need to use the custom model for on-demand inference again, you must create a new deployment.

**Topics**
+ [

## Delete a custom model deployment (console)
](#delete-deployment-console)
+ [

## Delete a custom model deployment (AWS Command Line Interface)
](#delete-deployment-cli)
+ [

## Delete a custom model deployment (AWS SDKs)
](#delete-deployment-sdk)

## Delete a custom model deployment (console)


**To delete a custom model deployment**

1. In the navigation pane, under **Inference and Assessment**, choose **Custom model on-demand**.

1. Choose the custom model deployment you want to delete.

1. Choose **Delete**.

1. In the confirmation dialog, enter the deployment name to confirm the deletion.

1. Choose **Delete** to confirm.

The deployment status will change to `Deleting` while the deletion is in progress. Once completed, the deployment will be removed from the list.

## Delete a custom model deployment (AWS Command Line Interface)


To delete a custom model deployment using the AWS Command Line Interface, use the `delete-custom-model-deployment` command with your deployment identifier.

```
aws bedrock delete-custom-model-deployment \
--custom-model-deployment-identifier "deployment-arn-or-name" \
--region region
```

## Delete a custom model deployment (AWS SDKs)


To delete a custom model deployment programmatically, use the [DeleteCustomModelDeployment](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_DeleteCustomModelDeployment.html) API operation with the deployment's Amazon Resource Name (ARN) or name. The following code shows how to use the SDK for Python (Boto3) to delete a custom model deployment. 

```
def delete_custom_model_deployment(bedrock_client):
    """Delete a custom model deployment
 
    Args:
        bedrock_client: A boto3 Bedrock client for making API calls
 
    Returns:
        dict: The response from the delete operation
 
    Raises:
        Exception: If there is an error deleting the deployment
    """
 
    try:
        response = bedrock_client.delete_custom_model_deployment(
            customModelDeploymentIdentifier="Deployment identifier"
        )
 
        print(f"Deployment deletion initiated")
        return response
 
    except Exception as e:
        print(f"Error deleting deployment: {str(e)}")
        raise
```

## On-demand inference on Custom Models
On-demand inference

On-demand (OD) inference allows you to run inference on your custom Amazon Nova models without maintaining provisioned throughput endpoints. This helps you optimize costs and scale efficiently. With On-demand inference, you are charged based on usage, measured in tokens, both in and out.

### Compatibility requirements


The following compatibility requirements apply:
+ OD inference is supported for Amazon Nova Pro, Lite and Micro custom understanding models. OD inference is not supported for Nova custom content generation models.
+ OD inference is supported for Amazon Nova custom understanding models trained after July 16, 2025. Custom models trained before July 16, 2025 are not compatible with OD inference. 
+ Amazon Bedrock customization: OD inference is supported for models customized with Amazon Bedrock customization and for student models that were distilled from a teacher model with Amazon Bedrock.
+ SageMaker AI customization: For models customized in SageMaker AI, OD inference is supported only for Parameter-efficient fine-tuned (PEFT) models when the model is hosted on Amazon Bedrock. This includes Direct Preference Optimization plus PEFT. OD inference is not supported for Full-rank fine-tuned models.

### Model training and inference


When you train a new custom Amazon Nova Pro, Lite, or Micro model on Amazon Bedrock or SageMaker AI using PEFT after July 16, 2025, the model will automatically be compatible with both provisioned and on-demand inference options. You can select your preferred inference method when you deploy your model.

To use OD inference with a model trained after July 16, 2025, complete the following steps:

1. Create a new fine-tuning job with either the [Amazon Bedrock customization API](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-bedrock.html) or the [SageMaker AI customization API](https://docs.aws.amazon.com/nova/latest/userguide/custom-fine-tune-models-sagemaker-tj.html).

1. Deploy the newly trained model to Amazon Bedrock using the [CreateCustomModel API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_CreateCustomModel.html).

1. Deploy for on-demand inference using the CustomModelDeployment API.

### Rate limits


The following requests per minute (RPM) and tokens per minute (TPM) limits apply to on-demand inference requests:


| Base Model for Custom Model | RPM per Custom Model Deployment | TPM per Custom Model Deployment | 
| --- |--- |--- |
| Nova 2 Lite | 2,000 | 4,000,000 | 

To learn more about the quotas available for Amazon Nova, see [Quotas for Amazon Nova](quotas.md).

### Latency


You can expect an end-to-end latency difference (that is, Time To First Token (TTFT)) of 20-55% between the base model invocation and the adapter. The exact latency value varies by model size and is in line with industry standards.

# Limitations of customizing Amazon Nova models
Limitations

Amazon Nova customization doesn't support the following capabilities on SageMaker.
+ **SSH into the instance to find the metrics**

  Due to security controls in place, you can't SSH into the master node in the training algo-1 instance to find memory stats or NVIDIA stats and validate the training steps.
+ **Support for SageMaker trainer SDK**

  SageMaker trainer is currently not available to start training jobs. You can start training jobs only via **estimator API** today.
+ **Warm pools are not accessible to SageMaker training jobs**

  Due to security controls in place, the SageMaker warm pools can't be used to keep the instance in the warm pool till the time to live.
+ **Custom model merging**

  Merging multiple models is not currently supported. This means that creating multiple LoRA adapters and perform a multi-merge operation with the base model is not available.
+ **Supported observability tool**

  [TensorBoard](https://www.tensorflow.org/tensorboard) is the only supported observability tool to view metrics for SageMaker training jobs. [MLFlow](https://mlflow.org/) or [WandB](https://wandb.ai/site/) are not currently supported. For more information about using TensorBoard in SageMaker, see [TensorBoard in SageMaker](https://docs.aws.amazon.com//sagemaker/latest/dg/tensorboard-on-sagemaker.html). 