

# Training for Amazon Nova models
<a name="nova-hp-training"></a>

Training Amazon Nova models on SageMaker HyperPod supports multiple techniques including Continued Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Fine-Tuning (RFT). Each technique serves different customization needs and can be applied to different Amazon Nova model versions.

**Topics**
+ [Continued pre-training (CPT)](nova-cpt.md)
+ [Supervised fine-tuning (SFT)](nova-fine-tune.md)
+ [Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod](nova-hp-rft.md)

# Continued pre-training (CPT)
<a name="nova-cpt"></a>

Continued pre-training (CPT) is a training technique that extends the pre-training phase of a foundation model by exposing it to additional unlabeled text from specific domains or corpora. Unlike supervised fine-tuning, which requires labeled input-output pairs, CPT trains on raw documents to help the model acquire deeper knowledge of new domains, learn domain-specific terminology and writing patterns, and adapt to particular content types or subject areas.

This approach is particularly valuable when you have large volumes (tens of billions of tokens) of domain-specific text data, such as legal documents, medical literature, technical documentation, or proprietary business content, and you want the model to develop native fluency in that domain. Generally, after the CPT stage, the model needs to undergo additional instruction tuning stages to enable the model to use the newly acquired knowledge and complete useful tasks.

**Supported models**  
CPT is available for the following Amazon Nova models:
+ Nova 1.0 (Micro, Lite, Pro)
+ Nova 2.0 (Lite)

**When to use Nova 1.0 versus Nova 2.0**  
The Amazon Nova family of models offers multiple price-performance operating points to optimize between accuracy, speed, and cost.

Choose Nova 2.0 when you need the following:
+ Advanced reasoning capabilities for complex analytical tasks
+ Superior performance on coding, math, and scientific problem-solving
+ Longer context length support
+ Better multilingual performance

**Note**  
The larger model is not always better. Consider the cost-performance tradeoff and your specific business requirements when selecting between Nova 1.0 and Nova 2.0 models.

# CPT on Nova 2.0
<a name="nova-cpt-2"></a>

Amazon Nova Lite 2.0 is a reasoning model trained on larger and more diverse datasets than Nova Lite 1.0. Despite being a larger model, Nova Lite 2.0 delivers faster inference than Nova Lite 1.0 while offering enhanced reasoning capabilities, longer context lengths, and improved multilingual performance.

CPT on Nova 2.0 allows you to extend these advanced capabilities with your domain-specific data, enabling the model to develop deep expertise in specialized areas while maintaining its superior reasoning and analytical abilities.

## Sample CPT recipe
<a name="nova-cpt-2-sample-recipe"></a>

The following is a sample recipe for CPT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/nova) repository.

```
# Note:
# This recipe can run on p5.48xlarge
# Run config
run:
  name: "my-cpt-run"                           # A descriptive name for your training job
  model_type: "amazon.nova-2-lite-v1:0:256k"   # Model variant specification, do not change
  model_name_or_path: "nova-lite-2/prod"        # Base model path, do not change
  replicas: 8                                   # Number of compute instances for training, allowed values are 4, 8, 16, 32
  data_s3_path: ""                              # Customer data paths
  validation_data_s3_path: ""                   # Customer validation data paths
  output_s3_path: ""                            # Output artifact path,  job-specific configuration - not compatible with standard SageMaker Training Jobs
  mlflow_tracking_uri: ""                       # Required for MLFlow
  mlflow_experiment_name: "my-cpt-experiment"   # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-cpt-run"                 # Optional for MLFlow. Note: leave this field non-empty

## Training specific configs
training_config:
  task_type: cpt
  max_length: 8192                              # Maximum context window size (tokens)
  global_batch_size: 256                        # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
    max_steps: 10                               # The number of training steps to run total
    val_check_interval: 10                      # The number of steps between running validation. Integer count or float percentage
    limit_val_batches: 2                        # Batches of the validation set to use each trigger

  model:
    hidden_dropout: 0.0                         # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0                      # Dropout for attention weights, must be between 0.0 and 1.0

  optim:
    optimizer: adam
    lr: 1e-5                                    # Learning rate
    name: distributed_fused_adam                # Optimizer algorithm, do not change
    adam_w_mode: true                           # Enable AdamW mode
    eps: 1e-06                                  # Epsilon for numerical stability
    weight_decay: 0.0                           # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                             # Beta1 for Adam optimizer
    adam_beta2: 0.95                            # Beta2 for Adam optimizer
    sched:
      warmup_steps: 10                          # Learning rate warmup steps
      constant_steps: 0                         # Steps at constant learning rate
      min_lr: 1e-6                              # Minimum learning rate, must be lower than lr
```

## Data preparation for CPT on 2.0
<a name="nova-cpt-2-data-prep"></a>

**Data format requirements**  
Training and validation datasets must be JSONL files following the format shown below, where each line contains a JSON object representing a conversation with the required fields and structure. Here is an example:

```
{"text": "AWS stands for Amazon Web Services"}
{"text": "Amazon SageMaker is a fully managed machine learning service"}
{"text": "Amazon Bedrock is a fully managed service for foundation models"}
```

Text entries should contain naturally flowing, high-quality content that represents the target domain.

Test that the data is capable of being converted into [Arrow format](https://huggingface.co/docs/datasets/en/about_arrow). Use the python script below to help with it. Ensure the `datasets==2.18.0` version at minimum is used:

```
from datasets import load_dataset, load_from_disk
from pathlib import Path

input_path = Path("<Your jsonl file>")
output_path = Path("<Your output directory>")

dataset = load_dataset("json", data_files=str(input_path), split="train")
dataset.save_to_disk(str(output_path), max_shard_size="1GB")

try:
  test_dataset = datasets.load_from_disk(output_dir)
  print(f"Dataset loaded successfully ✅! Contains {len(test_dataset)} samples")
except Exception as e:
  print(e)
```

It should print the same number of lines that were in the JSONL file.

When using datamixing, run the first job with `max_steps=2`. This will help create optimizations in the cluster for data access and validate that all the datamixes are available.

**How to prepare data for CPT**  
Training data is the most crucial determining factor for the success of continuous pre-training. While CPT data is often described as "unlabeled," the reality is far more nuanced. How data is structured, formatted, and presented determines whether the model will acquire the knowledge and skills required for the business use case.

### Preparing structured business datasets for CPT
<a name="nova-cpt-2-structured-data"></a>

This is a common challenge for companies and organizations building foundation models specialized in their domain. Most businesses possess rich repositories of structured data: product catalogs, user profiles, transaction logs, form submissions, API calls, and operational metadata. At first glance, this looks very different from the unstructured web text typically used in standard pre-training.

To effectively learn from structured business data, think carefully about downstream tasks and design the data presentation to force the model to learn the right predictive relationships.

To unlock the full potential of continuous pre-training, consider:
+ What tasks the model should perform at inference time
+ What information is present in the raw data
+ How to structure that data so the model learns to extract and manipulate the information correctly

Simply dumping structured data into training won't teach the model to reason about it. Actively shape the data presentation to guide what the model learns.

In the following sections, there is literature review demonstrating the importance of data augmentation and provide examples augmentation strategies for structured business data that will give useful ideas on how to treat and organize business dataset for CPT.

**Structured data for CPT in the literature**  
CPT can pack domain facts into the model but often fails to make those facts retrievable and manipulable when inputs or tasks shift. Controlled experiments show that without diverse augmentation during pretraining, models memorize facts in brittle ways that remain hard to extract even after later instruction tuning, and they recommend injecting instruction like signals early in training. For semi structured data, randomized serialization and other augmentations reduce schema overfitting, which is why CPT should be interleaved with instruction style tasks rather than run first and IFT later. Finance focused work further finds that jointly mixing CPT and instruction data at batch time improves generalization and reduces forgetting versus the sequential recipe. Qwen technical report converges on the same pattern by integrating high quality instruction data into pretraining itself, which boosts in context learning and preserves instruction following while acquiring new domain knowledge.

Data augmentation for semi structured corpora is a key lever. Synthetic graph aware CPT expands small domain sets into entity linked corpora that explicitly teach relationships and compounds with retrieval at inference time. Joint CPT plus instruction mixing outperforms sequential pipelines in finance and balancing domain with general data lowers degradation on general skills. Very large scale domain CPT can also retain broad ability and even allow trade offs through model merging, yet still points to instruction tuning as an essential next step, reinforcing the value of introducing instruction signals during CPT.

**Injecting diversity through randomization and shuffling**  
A general strategy that helps to teach model effectively from the structured and semi structured datasets is to shuffle the order of fields in the datasets, and even randomly drop out some keys.

Shuffling the fields forces the model to read what each value means instead of where it appears and learn the relationships between all the fields. For example, in case of an video game posted on amazon store, when "Title," "Platform," "Price," "Condition," and "Edition" arrive in different permutations, the model can't rely on "the third slot is platform"; it must bind labels to values and learn the bilateral relationships among attributes: title ⇄ platform, platform ⇄ price, condition ⇄ price. So it can, for example, infer a likely platform from a game name and an observed price, or estimate a plausible price range given a title and platform.

Randomly dropping keys during serialization acts like feature dropout: it prevents co-adaptation on any one field and forces the model to recover missing information from the remaining evidence. If "Platform" is absent, the model must pick it up from the title string or compatibility text; if "Price" is hidden, it has to triangulate from platform, edition, and condition. This builds symmetry (A→B and B→A), robustness to messy real-world listings, and schema invariance when fields are missing, renamed, or reordered.

An shopping-style example makes it concrete. Serialize the same item multiple ways—"Title: 'Elden Ring' \$1 Platform: PlayStation 5 \$1 Condition: Used—Like New \$1 Price: \$134.99" and a permutation like "Price: \$134.99 \$1 Title: 'Elden Ring' \$1 Condition: Used—Like New \$1 Platform: PlayStation 5"—and on some passes drop "Platform" while leaving "Compatible with PS5" in the description. Train complementary objectives such as predicting platform from \$1title, price\$1 and predicting a price bucket from \$1title, platform\$1. Because order and even presence of keys vary, the only stable strategy is to learn the true relationships between attributes rather than memorize a template.

### The way data is presented matters
<a name="nova-cpt-2-data-presentation"></a>

LLMs learn by predicting the next token from what they have already seen. So the order of fields and events shown during training decides what the model can learn. If the training format matches the real task, the loss lands on the exact decision tokens. If fields are tossed together without structure, the model learns shortcuts or memorizes popularity and then fails when asked to choose among options.

Show the situation first, then the options, then the decision. If the model should also learn about outcomes or explanations, put them after the decision.

### Packing samples for CPT
<a name="nova-cpt-2-packing"></a>

**What is packing?**  
It simply means to fill each sequence window in the training data with multiple whole examples so the window is dense with real tokens, not padding.

**Why it matters**  
During training a maximum context length is set, for example 8,192 tokens. Batches are shaped to [batch size × context length]. If a training example is shorter than the context length, the remaining positions are padded. Padding still runs through attention and MLP kernels even if loss is masked, so compute is paid for tokens that carry no learning signal.

**How to do packing?**  
To pack multiple samples, concatenate multiple training samples with a ` [DOC] ` separator in between (note the space before and after the [DOC] ) such that the full length of the samples are under the desired context length.

An example packed document would look like this:

```
{"text": "training sample 1 [DOC] training sample 2 [DOC] training sample 3"}
```

### CPT Tuning Parameters
<a name="nova-cpt-2-tuning-parameters"></a>

The parameters that are available for fine-tuning with CPT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging
+ **mlflow\$1experiment\$1name**: MLFlow experiment name
+ **mlflow\$1run\$1name**: MLFlow run name

**Training Configuration**  

+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 8192 tokens for CPT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.

**Trainer Settings**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` no. of elements

**Model Settings**  

+ **hidden\$1dropout**: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
+ **attention\$1dropout**: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **name**: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

# Supervised fine-tuning (SFT)
<a name="nova-fine-tune"></a>

The SFT training process consists of two main stages:
+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using , this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (epochs, learning rate, batch size)
  + Optional components (distributed training parameters, etc)

## Nova Model Comparison and Selection
<a name="nova-model-comparison"></a>

Amazon Nova 2.0 is a model trained on a larger and more diverse dataset than Amazon Nova 1.0. Key improvements include:
+ **Enhanced reasoning abilities** with explicit reasoning mode support
+ **Broader multilingual performance** across additional languages
+ **Improved performance on complex tasks** including coding and tool use
+ **Extended context handling** with better accuracy and stability at longer context lengths

## When to Use Nova 1.0 vs. Nova 2.0
<a name="nova-model-selection"></a>

Choose Amazon Nova 2.0 when:
+ Superior performance with advanced reasoning capabilities is needed
+ Multilingual support or complex task handling is required
+ Better results on coding, tool calling, or analytical tasks are needed

# SFT on Nova 2.0
<a name="nova-sft-2-fine-tune"></a>

Amazon Nova Lite 2.0 brings enhanced capabilities for supervised fine-tuning, including advanced reasoning mode, improved multimodal understanding, and extended context handling. SFT on Nova 2.0 enables you to adapt these powerful capabilities to your specific use cases while maintaining the model's superior performance on complex tasks.

Key features of SFT on Nova 2.0 include:
+ **Reasoning mode support**: Train models to generate explicit reasoning traces before final answers for enhanced analytical capabilities.
+ **Advanced multimodal training**: Fine-tune on document understanding (PDF), video understanding, and image-based tasks with improved accuracy.
+ **Tool calling capabilities**: Train models to effectively use external tools and function calling for complex workflows.
+ **Extended context support**: Leverage longer context windows with better stability and accuracy for document-intensive applications.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

**Topics**
+ [Reasoning Mode Selection (Nova 2.0 Only)](#nova-sft-2-reasoning-mode)
+ [Tool calling data format](#nova-sft-2-tool-calling)
+ [Document understanding data format](#nova-sft-2-document-understanding)
+ [Video Understanding for SFT](#nova-sft-2-video-understanding)
+ [Data Upload Instructions](#nova-sft-2-data-upload)
+ [Creating a Fine-Tuning Job](#nova-sft-2-creating-job)
+ [SFT Tuning Parameters](#nova-sft-2-tuning-parameters)
+ [Hyperparameter Guidance](#nova-sft-2-hyperparameters)

## Sample SFT recipe
<a name="nova-sft-2-sample-recipe"></a>

Below is a sample recipe for SFT. You can find this recipe and others in the [ recipes](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova) repository.

```
run:
  name: my-full-rank-sft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://my-bucket-name/train.jsonl  #  only and not compatible with SageMaker Training Jobs
  replicas: 4                                     # Number of compute instances for training, allowed values are 4, 8, 16, 32
  output_s3_path: s3://my-bucket-name/outputs/    # Output artifact path (HyperPod job-specific; not compatible with standard SageMaker Training Jobs)
  mlflow_tracking_uri: ""                         # Required for MLFlow
  mlflow_experiment_name: "my-full-rank-sft-experiment"  # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-full-rank-sft-run"         # Optional for MLFlow. Note: leave this field non-empty

training_config:
  max_steps: 100                    # Maximum training steps. Minimal is 4.
  save_steps: ${oc.select:training_config.max_steps}  # How many training steps the checkpoint will be saved
  save_top_k: 5                     # Keep top K best checkpoints. Note supported only for  jobs. Minimal is 1.
  max_length: 32768                 # Sequence length (options: 8192, 16384, 32768 [default], 65536)
  global_batch_size: 32             # Global batch size (options: 32, 64, 128)
  reasoning_enabled: true           # If data has reasoningContent, set to true; otherwise False

  lr_scheduler:
    warmup_steps: 15                # Learning rate warmup steps. Recommend 15% of max_steps
    min_lr: 1e-6                    # Minimum learning rate, must be between 0.0 and 1.0

  optim_config:                     # Optimizer settings
    lr: 1e-5                        # Learning rate, must be between 0.0 and 1.0
    weight_decay: 0.0               # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                  # Exponential decay rate for first-moment estimates
    adam_beta2: 0.95                 # Exponential decay rate for second-moment estimates

  peft:                             # Parameter-efficient fine-tuning (LoRA)
    peft_scheme: "null"             # Disable LoRA for PEFT
```

## Reasoning Mode Selection (Nova 2.0 Only)
<a name="nova-sft-2-reasoning-mode"></a>

Amazon Nova 2.0 supports reasoning mode for enhanced analytical capabilities:
+ **Reasoning Mode (enabled)**:
  + Set `reasoning_enabled: true` in the training configuration
  + Model trains to generate reasoning traces before final answers
  + Improves performance on complex reasoning tasks
+ **Non-Reasoning Mode (disabled)**:
  + Set `reasoning_enabled: false` or omit the parameter (default)
  + Standard SFT without explicit reasoning
  + Suitable for tasks that don't benefit from step-by-step reasoning

**Note**  
When reasoning is enabled, it operates at high reasoning effort. There is no low reasoning option for SFT.
Multimodal reasoning content is not supported for SFT. Reasoning mode applies to text-only inputs.

### Using reasoning mode with non-reasoning datasets
<a name="nova-sft-2-reasoning-non-reasoning-data"></a>

Training Amazon Nova on a non-reasoning dataset with `reasoning_enabled: true` is permitted. However, doing so may cause the model to lose its reasoning capabilities, as Amazon Nova primarily learns to generate the responses presented in the data without applying reasoning.

If training Amazon Nova on a non-reasoning dataset but still want to use reasoning during inference:

1. Disable reasoning during training (`reasoning_enabled: false`)

1. Enable reasoning later during inference

While this approach allows reasoning at inference time, it does not guarantee improved performance compared to inference without reasoning.

**Best practice:** Enable reasoning for both training and inference when using reasoning datasets, and disable it for both when using non-reasoning datasets.

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Tool calling data format
<a name="nova-sft-2-tool-calling"></a>

SFT supports training models to use tools (function calling). Below is a sample input format for tool calling:

**Sample input:**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are an expert in composing function calls."
    }
  ],
  "toolConfig": {
    "tools": [
      {
        "toolSpec": {
          "name": "getItemCost",
          "description": "Retrieve the cost of an item from the catalog",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "item_name": {
                  "type": "string",
                  "description": "The name of the item to retrieve cost for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to retrieve cost for"
                }
              },
              "required": [
                "item_id"
              ]
            }
          }
        }
      },
      {
        "toolSpec": {
          "name": "getItemAvailability",
          "description": "Retrieve whether an item is available in a given location",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "zipcode": {
                  "type": "string",
                  "description": "The zipcode of the location to check in"
                },
                "quantity": {
                  "type": "integer",
                  "description": "The number of items to check availability for"
                },
                "item_id": {
                  "type": "string",
                  "description": "The ASIN of item to check availability for"
                }
              },
              "required": [
                "item_id", "zipcode"
              ]
            }
          }
        }
      }
    ]
  },
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "I need to check whether there are twenty pieces of the following item available. Here is the item ASIN on Amazon: id-123. Please check for the zipcode 94086"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "reasoningContent": {
            "reasoningText": {
              "text": "The user wants to check how many pieces of the item with ASIN id-123 are available in the zipcode 94086"
            }
          }
        },
        {
          "toolUse": {
            "toolUseId": "getItemAvailability_0",
            "name": "getItemAvailability",
            "input": {
              "zipcode": "94086",
              "quantity": 20,
              "item_id": "id-123"
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "toolResult": {
            "toolUseId": "getItemAvailability_0",
            "content": [
              {
                "text": "[{\"name\": \"getItemAvailability\", \"results\": {\"availability\": true}}]"
              }
            ]
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Yes, there are twenty pieces of item id-123 available at 94086. Would you like to place an order or know the total cost?"
        }
      ]
    }
  ]
}
```

Important considerations for tool calling data:
+ ToolUse must appear in assistant turns only
+ ToolResult must appear in user turns only
+ ToolResult should be text or JSON only; other modalities are not currently supported for Amazon Nova models
+ The inputSchema within the toolSpec must be a valid JSON Schema object
+ Each ToolResult must reference a valid toolUseId from a preceding assistant ToolUse, with each toolUseId used exactly once per conversation

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Document understanding data format
<a name="nova-sft-2-document-understanding"></a>

SFT supports training models on document understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "document": {
            "format": "pdf",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/documents/customer_service_debugging.pdf",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the document to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for document understanding:
+ Only PDF files are supported
+ Maximum document size is 10 MB
+ A sample can contain documents and text, but cannot mix documents with other modalities (such as images or video)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Video Understanding for SFT
<a name="nova-sft-2-video-understanding"></a>

SFT supports fine-tuning models for video understanding tasks. Below is a sample input format:

**Sample input**

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What are the ways in which a customer can experience issues during checkout on Amazon?"
        },
        {
          "video": {
            "format": "mp4",
            "source": {
              "s3Location": {
                "uri": "s3://my-bucket-name/path/to/videos/customer_service_debugging.mp4",
                "bucketOwner": "123456789012"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Customers can experience issues with 1. Data entry, 2. Payment methods, 3. Connectivity while placing the order. Which one would you like to dive into?"
        }
      ],
      "reasoning_content": [
        {
          "text": "I need to find the relevant section in the video to answer the question.",
          "type": "text"
        }
      ]
    }
  ]
}
```

Important considerations for video understanding:
+ Videos can be a maximum of 50 MB
+ Videos can be up to 15 minutes long
+ Only one video is allowed per sample; multiple videos in the same sample are not supported
+ A sample can contain video and text, but cannot mix video with other modalities (such as images or documents)

**Note**  
For more information on which container images, or example recipes to use go to [ Amazon Nova recipes](nova-model-recipes.md).

## Data Upload Instructions
<a name="nova-sft-2-data-upload"></a>

Upload training and validation datasets to an S3 bucket. Specify these locations in the recipe's `run` block:

```
## Run config
run:
  ...
  data_s3_path: "s3://<bucket-name>/<training-directory>/<training-file>.jsonl"
```

**Note**: Replace `<bucket-name>`, `<training-directory>`, `<validation-directory>`, `<training-file>`, and `<validation-file>` with actual S3 paths.

**Note**: Validation datasets are not currently supported for SFT with Amazon Nova 2.0. If a validation dataset is provided, it will be ignored.

## Creating a Fine-Tuning Job
<a name="nova-sft-2-creating-job"></a>

Define the base model using the `model_type` and `model_name_or_path` fields in the `run` block:

```
## Run config
run:
  ...
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  ...
```

## SFT Tuning Parameters
<a name="nova-sft-2-tuning-parameters"></a>

The parameters that are available for tuning with SFT include:

**Run Configuration**  

+ **name**: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
+ **model\$1type**: The Amazon Nova model variant to use. The available options are `amazon.nova-2-lite-v1:0:256k`.
+ **model\$1name\$1or\$1path**: The path to the base model to use for your training. The available options are `nova-lite-2/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
+ **replicas**: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Lite 2.0 supports 4, 8, 16, or 32 replicas.
+ **data\$1s3\$1path**: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **validation\$1data\$1s3\$1path**: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **output\$1s3\$1path**: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **mlflow\$1tracking\$1uri**: The ARN of the MLFlow App to use for MLFlow logging.
+ **mlflow\$1experiment\$1name**: MLFlow experiment name.
+ **mlflow\$1run\$1name**: MLFlow run name.

**Training Configuration**  

+ **max\$1steps**: The number of training steps to run. Each step will train the model with `global_batch_size` number of elements.
+ **save\$1steps**: The frequency (in steps) at which to save model checkpoints during training.
+ **save\$1top\$1k**: The maximum number of best checkpoints to retain based on validation metrics.
+ **max\$1length**: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value is 32768 tokens for SFT.

  Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the max\$1length parameter to your data distribution.
+ **global\$1batch\$1size**: The total number of training samples processed together in one forward or backward pass across all devices and workers.

  This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.
+ **reasoning\$1enabled**: Boolean flag to enable reasoning capabilities during training.

**Learning Rate Scheduler**  

+ **warmup\$1steps**: The number of steps to gradually increase learning rate. This improves training stability.
+ **min\$1lr**: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

**Optimizer Configuration**  

+ **lr**: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
+ **weight\$1decay**: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
+ **adam\$1beta1**: The exponential decay rate for the first moment estimates in Adam optimizer. Default is 0.9.
+ **adam\$1beta2**: The exponential decay rate for the second moment estimates in Adam optimizer. Default is 0.95.

**PEFT Configuration**  

+ **peft\$1scheme**: The parameter-efficient fine-tuning scheme to use. Options are `'null'` for full-rank fine-tuning or `lora` for LoRA-based fine-tuning.

**LoRA Tuning (when peft\$1scheme is 'lora')**  

+ **alpha**: The LoRA scaling parameter. Controls the magnitude of the low-rank adaptation. Typical values range from 8 to 128.
+ **lora\$1plus\$1lr\$1ratio**: The learning rate ratio for LoRA\$1 optimization. This multiplier adjusts the learning rate specifically for LoRA parameters.

## Hyperparameter Guidance
<a name="nova-sft-2-hyperparameters"></a>

Use the following recommended hyperparameters based on the training approach:

**Full Rank Training**
+ **Epochs**: 1
+ **Learning rate (lr)**: 1e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**LoRA (Low-Rank Adaptation)**
+ **Epochs**: 2
+ **Learning rate (lr)**: 5e-5
+ **Minimum learning rate (min\$1lr)**: 1e-6

**Note**: Adjust these values based on dataset size and validation performance. Monitor training metrics to prevent overfitting.

# Reinforcement Fine-Tuning (RFT) on SageMaker HyperPod
<a name="nova-hp-rft"></a>

Reinforcement Fine-Tuning (RFT) is a machine learning technique that improves model performance through feedback signals—measurable scores or rewards indicating response quality—rather than direct supervision with exact correct answers. Unlike traditional supervised fine-tuning that learns from input-output pairs, RFT uses reward functions to evaluate model responses and iteratively optimizes the model to maximize these rewards.

This approach is particularly effective for tasks where defining the exact correct output is challenging, but you can reliably measure response quality. RFT enables models to learn complex behaviors and preferences through trial and feedback, making it ideal for applications requiring nuanced decision-making, creative problem-solving, or adherence to specific quality criteria that can be programmatically evaluated.

**When to use RFT**  
Use RFT when you can define clear, measurable success criteria but struggle to provide exact correct outputs for training. It's ideal for tasks where quality is subjective or multifaceted—such as creative writing, code optimization, or complex reasoning—where multiple valid solutions exist but some are clearly better than others.

RFT works best when you have the following:
+ A reliable reward function that can evaluate model outputs programmatically
+ Need to align model behavior with specific preferences or constraints
+ Situations where traditional supervised fine-tuning falls short because collecting high-quality labeled examples is expensive or impractical

Consider RFT for applications requiring iterative improvement, personalization, or adherence to complex business rules that can be encoded as reward signals.

**What RFT is best suited for**  
RFT excels in domains where output quality can be objectively measured but optimal responses are difficult to define upfront:
+ **Mathematical problem-solving**: Verifiable correctness with multiple solution paths
+ **Code generation and optimization**: Testable execution results and performance metrics
+ **Scientific reasoning tasks**: Logical consistency and factual accuracy
+ **Structured data analysis**: Programmatically verifiable outputs
+ **Multi-step reasoning**: Tasks requiring step-by-step logical progression
+ **Tool usage and API calls**: Success measurable by execution results
+ **Complex workflows**: Adherence to specific constraints and business rules

RFT works exceptionally well when you need to balance multiple competing objectives like accuracy, efficiency, and style.

**When to use reasoning mode for RFT training**  
Amazon Nova 2.0 supports reasoning mode during RFT training. The following modes are available:
+ **none**: No reasoning (omit the reasoning\$1effort field)
+ **low**: Minimal reasoning overhead
+ **high**: Maximum reasoning capability (default when reasoning\$1effort is specified)

**Note**  
There is no medium option for RFT. If the reasoning\$1effort field is absent from your configuration, reasoning is disabled.

Use high reasoning for the following:
+ Complex analytical tasks
+ Mathematical problem-solving
+ Multi-step logical deduction
+ Tasks where step-by-step thinking adds value

Use none (omit reasoning\$1effort) or low reasoning for the following:
+ Simple factual queries
+ Direct classifications
+ Speed and cost optimization
+ Straightforward question-answering

**Important**  
Higher reasoning modes increase training time and cost, inference latency and cost, but also increase model capability for complex reasoning tasks.

**Supported models**  
RFT onSageMaker HyperPod supports Amazon Nova Lite 2.0 (amazon.nova-2-lite-v1:0:256k).

**Major steps**  
The RFT process involves four key phases:
+ **Implementing an evaluator**: Create a reward function to programmatically score model responses based on your quality criteria.
+ **Uploading prompts**: Prepare and upload training data in the specified conversational format with reference data for evaluation.
+ **Starting a job**: Launch the reinforcement fine-tuning process with your configured parameters.
+ **Monitoring**: Track training progress through metrics dashboards to ensure the model learns effectively.

Each step builds on the previous one, with the evaluator serving as the foundation that guides the entire training process by providing consistent feedback signals.

**Topics**
+ [RFT on Nova 2.0](nova-hp-rft-nova2.md)

# RFT on Nova 2.0
<a name="nova-hp-rft-nova2"></a>

RFT training data follows the OpenAI conversational format. Each training example is a JSON object containing messages, reference answers, and optional tool definitions. This section provides guidance on preparing effective training data for RFT on Nova 2.0.

**Topics**
+ [Data format and structure](#nova-hp-rft-data-format)
+ [Field descriptions](#nova-hp-rft-field-descriptions)
+ [Hyperparameter guidance](#nova-hp-rft-monitoring-hyperparams)
+ [Additional properties](#nova-hp-rft-additional-properties)
+ [Dataset size recommendations](#nova-hp-rft-dataset-size)
+ [Characteristics of effective training data](#nova-hp-rft-effective-data)
+ [Monitoring RFT training](nova-hp-rft-monitoring.md)

## Data format and structure
<a name="nova-hp-rft-data-format"></a>

Each training example is a JSON object containing the following:
+ **messages**: An array of conversational turns using system, user, and optionally assistant roles
+ **reference\$1answer**: Expected output or evaluation criteria for reward calculation
+ **tools** (optional): Array of function definitions available to the model
+ **id** (optional): Unique identifier for tracking and deduplication

Each example should be on a single line in your JSONL file, with one JSON object per line.

### Example 1: Chemistry problem
<a name="nova-hp-rft-example-chemistry"></a>

The following example shows a chemistry problem with reference answer containing ground truth values:

```
{  
  "id": "chem-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful chemistry assistant"  
    },  
    {  
      "role": "user",  
      "content": "Predict hydrogen bond donors and acceptors for this SMILES: CCN(CC)CCC(=O)c1sc(N)nc1C"  
    }  
  ],  
  "reference_answer": {  
    "donor_bond_counts": 2,  
    "acceptor_bond_counts": 4,  
    "explanation": "Calculated using Lipinski's rule of five: N-H groups (2 donors), N and O atoms with lone pairs (4 acceptors)"  
  }  
}
```

**Note**  
The reference\$1answer contains ground truth values calculated using domain-specific rules. Your reward function compares the model's predicted values against these reference values to calculate a reward score.

### Example 2: Math problem
<a name="nova-hp-rft-example-math"></a>

The following example shows a math problem with solution steps:

```
{  
  "id": "math-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  }  
}
```

### Example 3: Tool usage
<a name="nova-hp-rft-example-tool"></a>

The following example shows tool usage with expected behavior:

```
{  
  "id": "tool-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful game master assistant"  
    },  
    {  
      "role": "user",  
      "content": "Generate a strength stat for a warrior character. Apply a +2 racial bonus modifier."  
    }  
  ],  
  "tools": [  
    {  
      "type": "function",  
      "function": {  
        "name": "StatRollAPI",  
        "description": "Generates character stats by rolling 4d6, dropping the lowest die result, and applying a modifier.",  
        "parameters": {  
          "type": "object",  
          "properties": {  
            "modifier": {  
              "description": "An integer representing the modifier to apply to the total of the stat roll.",  
              "type": "integer"  
            }  
          },  
          "required": ["modifier"]  
        }  
      }  
    }  
  ],  
  "reference_answer": {  
    "tool_called": "StatRollAPI",  
    "tool_parameters": {  
      "modifier": 2  
    },  
    "expected_behavior": "Call StatRollAPI with modifier=2 and return the calculated stat value"  
  }  
}
```

## Field descriptions
<a name="nova-hp-rft-field-descriptions"></a>


| Field | Description | Additional notes | Required | 
| --- |--- |--- |--- |
| id | Unique identifier for this RFT example | String (for example, "sample-001"). Useful for tracking and deduplication. | No | 
| messages | Ordered list of chat messages that define the prompt and context | Array of objects. Model sees them in order. Typically starts with a system message, then user. | Yes | 
| messages[].role | Who is speaking in the message | Common values: "system", "user" (sometimes "assistant" in other contexts) | No | 
| messages[].content | The text content of the message | Plain string. For system it's instructions, for user it's the task or input. | No | 
| tools | Tool specifications available to the model during this example | Array. Each item defines a tool's interface and metadata. Types may include "function" or "internal". | No | 
| reference\$1answer | The expected model output for this example | String or object depending on task. Used as target for evaluation or training. | No | 

**Note**  
Any additional custom fields (for example, task\$1id, difficulty\$1level, context\$1data) are not validated and will be passed to your reward function as metadata.

## Hyperparameter guidance
<a name="nova-hp-rft-monitoring-hyperparams"></a>

Use the following recommended hyperparameters based on your training approach:

**General:**
+ Epochs: 1
+ Learning rate (lr): 1e-7
+ Number of generations: 8
+ Max new tokens: 8192
+ Batch size: 256

**LoRA (Low-Rank Adaptation):**
+ LoRA Rank: 32

**Note**  
Adjust these values based on your dataset size and validation performance. Monitor training metrics to prevent overfitting.

## Additional properties
<a name="nova-hp-rft-additional-properties"></a>

The "additionalProperties": true setting allows you to include custom fields beyond the core schema requirements, providing flexibility to add any data your reward function needs for proper evaluation.

### Common additional fields
<a name="nova-hp-rft-common-fields"></a>

You can include the following types of additional fields:

**Metadata:**
+ task\$1id: Unique identifier for tracking
+ difficulty\$1level: Problem complexity indicator
+ domain: Subject area or category
+ expected\$1reasoning\$1steps: Number of steps in solution

**Evaluation criteria:**
+ evaluation\$1criteria: Specific grading rubrics
+ custom\$1scoring\$1weights: Relative importance of different aspects
+ context\$1data: Background information for the problem
+ external\$1references: Links to relevant documentation or resources

### Example with additional properties
<a name="nova-hp-rft-additional-example"></a>

The following example includes custom metadata fields:

```
{  
  "id": "algebra_001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  },  
  "task_id": "algebra_001",  
  "difficulty_level": "easy",  
  "domain": "algebra",  
  "expected_reasoning_steps": 3  
}
```

## Dataset size recommendations
<a name="nova-hp-rft-dataset-size"></a>

### Starting point
<a name="nova-hp-rft-starting-point"></a>

Begin with the following minimum dataset sizes:
+ Minimum 100 training examples
+ Minimum 100 evaluation examples

Prioritize high-quality input data and a reliable reward function that executes consistently on model responses.

### Evaluation-first approach
<a name="nova-hp-rft-evaluation-first"></a>

Before investing in large-scale RFT training, evaluate your model's baseline performance:
+ **High performance (greater than 95% reward)**: RFT may be unnecessary—your model already performs well
+ **Very poor performance (0% reward)**: Switch to SFT first to establish basic capabilities
+ **Moderate performance**: RFT is likely appropriate

This evaluation-first approach ensures your reward function is bug-free and determines if RFT is the right method for your use case. Starting small allows you to get comfortable with the RFT workflow, identify and fix issues early, validate your approach before scaling up, and test reward function reliability. Once validated, you can expand to larger datasets to further improve performance.

## Characteristics of effective training data
<a name="nova-hp-rft-effective-data"></a>

### Clarity and consistency
<a name="nova-hp-rft-clarity"></a>

Good RFT examples require clear, unambiguous input data that enables accurate reward calculation across different model outputs. Avoid noise in your data, including:
+ Inconsistent formatting
+ Contradictory labels or instructions
+ Ambiguous prompts
+ Conflicting reference answers

Any ambiguity will mislead the training process and cause the model to learn unintended behaviors.

### Diversity
<a name="nova-hp-rft-diversity"></a>

Your dataset should capture the full diversity of production use cases to ensure robust real-world performance. Include:
+ Various problem types and difficulty levels
+ Different input formats and edge cases
+ Representative samples from all expected scenarios

This diversity helps prevent overfitting and ensures the model handles unfamiliar inputs gracefully.

### Reward function considerations
<a name="nova-hp-rft-reward-considerations"></a>

Design your reward function for efficient training:
+ Execute within seconds (not minutes)
+ Parallelize effectively with Lambda
+ Return consistent, reliable scores
+ Handle different types of model outputs gracefully

Fast, scalable reward functions enable rapid iteration and cost-effective experimentation at scale.

# Monitoring RFT training
<a name="nova-hp-rft-monitoring"></a>

Monitor key metrics during training to ensure effective learning and identify potential issues early.

**Topics**
+ [Key metrics to track](#nova-hp-rft-monitoring-metrics)
+ [Evaluation after RFT](#nova-hp-rft-monitoring-evaluation)
+ [Using fine-tuned models](#nova-hp-rft-monitoring-checkpoints)
+ [Limitations and best practices](#nova-hp-rft-monitoring-limitations)
+ [Troubleshooting](#nova-hp-rft-monitoring-troubleshooting)

## Key metrics to track
<a name="nova-hp-rft-monitoring-metrics"></a>

Monitor the following metrics using MlFlow during training:

**Reward metrics:**
+ **Average reward score**: Overall quality of model responses (should increase over time)
+ **Reward distribution**: Percentage of responses receiving high, medium, and low rewards
+ **Training vs. validation rewards**: Compare to detect overfitting

**Training metrics:**
+ **Policy updates**: Number of successful weight updates
+ **Rollout completion rate**: Percentage of samples successfully evaluated

**Concerning patterns:**
+ Rewards plateauing (indicates poor learning)
+ Validation rewards dropping while training rewards increase (overfitting)
+ Reward variance increasing significantly over time (instability)
+ High percentage of reward function errors (implementation issues)

**When to stop training:**
+ Target performance metrics are achieved
+ Rewards plateau and no longer improve
+ Validation performance degrades (overfitting detected)
+ Maximum training budget is reached

## Evaluation after RFT
<a name="nova-hp-rft-monitoring-evaluation"></a>

After training completes, evaluate your fine-tuned model to assess performance improvements:
+ **Run RFT evaluation job**: Use the checkpoint from your RFT training as the model
+ **Compare to baseline**: Evaluate both base model and fine-tuned model on the same test set
+ **Analyze metrics**: Review task-specific metrics (accuracy, reward scores, etc.)
+ **Conduct qualitative review**: Manually inspect sample outputs for quality

For detailed evaluation procedures, see the Evaluation section.

## Using fine-tuned models
<a name="nova-hp-rft-monitoring-checkpoints"></a>

**Accessing checkpoints:**

After training completes, locate your checkpoint:

1. Navigate to your `output_path` in S3

1. Download and extract `output.tar.gz`

1. Open `manifest.json`

1. Copy the `checkpoint_s3_bucket` value

**Deploying for inference:**

Use the checkpoint S3 path for inference or further training:

```
run:
    model_type: amazon.nova-2-lite-v1:0:256k
    model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

For deployment and inference instructions, refer to the Inference section.

## Limitations and best practices
<a name="nova-hp-rft-monitoring-limitations"></a>

**Current limitations:**

**Beta restrictions:**
+ Need to create a new RIG group for RFT. This limitation will be resolved by GA.
+ Instance type requirements: Only P5 instances supported (minimum 8x P5.48xlarge). Coming Soon: Support for smaller instance types (ETA: mid-January 2025).

**Functional limitations:**
+ 15-minute Lambda timeout: Reward functions must complete within 15 minutes
+ Single-turn only: Multi-turn conversations not supported
+ Validation datasets: Not supported during training. Use separate evaluation jobs to assess training progress.

**Training considerations:**
+ Low reward scenarios: May struggle when less than 5% of examples receive positive rewards - consider SFT first
+ Data requirements: Needs sufficient diversity to learn effectively
+ Computational cost: More expensive than supervised fine-tuning

**Nova Forge removes some of these limitations:**
+ Supports multi-turn conversations
+ Allows reward functions exceeding 15-minute timeouts
+ Provides advanced algorithms and tuning options
+ Designed for complex enterprise use cases, specifically tuned to build frontier models

**Best practices:**

**Start small and scale:**
+ Begin with minimal datasets (100-200 examples) and few training epochs
+ Validate your approach before scaling up
+ Gradually increase dataset size and training steps based on results

**Baseline with SFT first:**
+ If reward scores are consistently low (e.g., always 0), perform SFT before RFT
+ RFT requires reasonable baseline performance to improve effectively

**Design efficient reward functions:**
+ Execute in seconds, not minutes
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement proper error handling
+ Test thoroughly before training
+ Leverage Lambda's parallel scaling capabilities

**Monitor training actively:**
+ Track average reward scores over time
+ Watch reward distribution across samples
+ Compare training vs. validation rewards
+ Look for concerning patterns (plateaus, overfitting, instability)

**Iterate based on results:**
+ If rewards don't improve after several iterations, adjust reward function design
+ Increase dataset diversity to provide clearer learning signals
+ Consider switching to SFT if rewards remain near zero
+ Experiment with different hyperparameters (learning rate, batch size)

**Optimize data quality:**
+ Ensure diverse, representative examples
+ Include edge cases and difficult samples
+ Verify reward function correctly scores all example types
+ Remove or fix samples that confuse the reward function

## Troubleshooting
<a name="nova-hp-rft-monitoring-troubleshooting"></a>

**Reward function errors:**

Symptoms: High error rate in reward function calls during training


| Issue | Symptoms | Resolution | 
| --- |--- |--- |
| Lambda timeout | Frequent timeouts after 15 minutes | Optimize function performance; consider Nova Forge for complex evaluations | 
| Insufficient concurrency | Lambda throttling errors | Increase lambda\$1concurrency\$1limit or request quota increase | 
| Invalid return format | Training fails with format errors | Verify return structure matches required interface format | 
| Unhandled exceptions | Intermittent errors | Add comprehensive error handling and logging | 
| External API failures | Inconsistent scoring | Implement retry logic and fallback strategies | 

**Poor training performance:**

Symptoms: Rewards not improving or plateauing at low values

Resolutions:
+ **Verify reward function correctness**: Test with known good/bad examples
+ **Check baseline performance**: Evaluate base model; if near-zero accuracy, do SFT first
+ **Increase data diversity**: Add more varied examples covering different scenarios
+ **Adjust hyperparameters**: Try different learning rates or batch sizes
+ **Review reward signal quality**: Ensure rewards differentiate between good and bad responses

**Overfitting:**

Symptoms: Training rewards increase while validation rewards decrease

Resolutions:
+ **Reduce training steps**: Stop training earlier
+ **Increase dataset size**: Add more training examples
+ **Add regularization**: Adjust `weight_decay` or `entropy_coeff`
+ **Increase data diversity**: Ensure training set represents full distribution