# Amazon Nova Forge
Nova Forge

Amazon Nova Forge is a first-of-its-kind service that offers organizations the easiest and most cost-effective way to build their own frontier models using Nova.

Amazon Nova Forge introduces the concept of “open training" models, which give organizations access to a variety of early model checkpoints and the ability to blend proprietary data with Amazon-curated data sets at every stage of model training. This allows the models to maximize learning from proprietary data while minimizing risk of forgetting foundational skills like reasoning.

Nova Forge provides the following key capabilities:
+ Access checkpoints across all phases of model development, and leverage new Nova models before they are widely available
+ Blend your proprietary data with Amazon Nova-curated training data
+ Perform reinforcement learning with reward functions in your environment
+ Use push-button recipes that are optimized to build with Nova through visual workflows or a command line interface
+ Use the Responsible AI Toolkit to align models to Amazon Nova's responsible AI guidelines during the training process and implement runtime controls to moderate model responses during inference.

## Prerequisites


**Topics**
+ [

### Subscribe to Nova Forge
](#nova-forge-prereq-access)
+ [

### Other prerequisites
](#nova-forge-prereq-other)

### Subscribe to Nova Forge


To request access to the Amazon Nova Forge service, add the following tag to your console IAM role: key forge-subscription with value true. After you've added this tag to your role, please go to SageMaker AI Console > Model training and customization and click on Nova Forge. On this page, you'll find details about the service, pricing information and the capabilities. You can request subscription and then manage your subscription from this page.

1. The role should have permission to call api `ListAttachedRolePolicy`, and the response should include either `AdministratorAccess` or `AmazonSageMakerFullAccess` policy.

1. The sign-in role should have permission to call api `ListRoleTags`, and the response tags should include `tag.key=forge-subscription`.

### Other prerequisites


Also ensure the following prerequisites are complete:

1. [General prerequisites](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html#nova-model-general-prerequisites)

1. Additional steps for users: Add Restricted Instance Group (RIG) to your SageMaker HyperPod cluster (to complete follow steps [here](https://docs.aws.amazon.com//sagemaker/latest/dg/nova-hp-cluster.html))

# Nova Forge access and setup
Nova Forge access and setup

To get onboarded to Nova Forge, follow this 2-step process:
+ Step 1: Subscribe to Nova Forge
+ Step 2: Set up HyperPod infrastructure

## Getting the Nova Forge documents


To get the Nova Forge documents follow the below steps:

```
mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .
```

## Step 1: Subscribe to Nova Forge


### Quick Summary:


1. Verify that you have administrator access to the Amazon Web Services account.

1. Navigate to the SageMaker AI console and request access to Nova Forge.

1. Wait for the Nova team to email a confirmation after your subscription request is approved.

1. Tag your execution role with the `forge-subscription` tag. This tag is required to access Nova Forge features and checkpoints. Add the following tag to your execution role:
   + Key: `forge-subscription`
   + Value: `true`

### Detailed Guide


To subscribe to Nova Forge and effectively use the customization service, an Amazon Web Services customer must have admin access to their Amazon Web Services account or have their administrator grant them admin access. This document outlines the steps required to:
+ Secure admin access
+ Set up policies to subscribe to Nova Forge
+ Access customization recipes
+ Configure customization
+ Monitor the workflow
+ Evaluate the customized model checkpoint

### Option A


Flow 1: The account user must reach out to the account admin to request the following:
+ Add the `forge-subscription` tag to the account through IAM (see Appendix A for steps).
+ Add the `ListRoleTags` and `ListAttachedRolePolicies` permissions through IAM (see Appendix B for steps).

![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/Onboarding-option-a.png)


### Option B


Flow: The account user must reach out to the account admin to request admin access to the account.
+ Once admin access is granted, follow the steps in Flow 2.

### Flow 2. Amazon Web Services account w/ admin access

+ Add forge-subscription tag to account through IAM. See Steps in Appendix A

### Appendix A. Add forge-subscription policy to Amazon Web Services account


1. Go to the Amazon Web Services IAM Dashboard. Click on Roles on the left. Search for admin and click on the admin role  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/add-forge-sub-policy.png)

1. Select <AssumedRoleToUse> (e.g., libsAdminAccess). Click on the Tags tab.  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/add-forge-sub-policy-2.png)

1. Click on Manage tabs. Add new tag. Type "forge-subscription" under Key and click on save changes  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/add-forge-sub-tag-policy.png)

1. Ensure that you see forge-subscription as a key in Tags section  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/forge-tag-policy-verify.png)

### Appendix B. Add ListRoleTags and ListAttachedPolicies policies to Amazon Web Services account for Non-Admin Role by Admin


1. Go to the Amazon Web Services IAM Dashboard. Click on Roles on the left. Search for <AssumedRoleToUse> (e.g., ForgeAccessRole) and click on the <AssumedRoleToUse> (e.g., ForgeAccessRole) role  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/forge-list-tags-policy.png)

1. Click on the <AssumedRoleToUse> (e.g., ForgeAccessRole) role and select Tags. Add a new tag with type "forge-subscription"  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/forge-tag-appendix.png)

1. Under Permissions, add new permission: Add Permissions → Create inline policy → Add the following policy listed below

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "VisualEditor0",
               "Effect": "Allow",
               "Action": [
                   "iam:ListRoleTags",
                   "iam:ListAttachedRolePolicies"
               ],
               "Resource": "*"
           }
       ]
   }
   ```  
![\[alt text not found\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/images/forge-add-tag-polices-example.png)

## Step 2. Set up HyperPod infrastructure


Set up the necessary infrastructure by following the [workshop instructions](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US) for configuring the environment with Forge-enabled features.

## Content moderation settings


If you need access to Nova Forge, customizable content moderation settings (CCMS) are available for Amazon Nova Lite 1.0 and Pro 1.0 models. CCMS allows adjustment of content moderation controls to align with specific business requirements while maintaining essential responsible AI safeguards. To determine if a business model is appropriate for CCMS, contact an AWS Account Manager.

For additional information on configuring and using CCMS with custom models, see the [Responsible AI Toolkit and Content Moderation section](nova-responsible-ai-toolkit.md).

# Continued Pre-Training and Mid-Training


**Note**  
Detailed documentation is provided once subscribed

Nova Forge CPT offers advanced capabilities beyond standard CPT, including access to intermediate checkpoints and data mixing with Nova's pre-training corpus. These features enable more efficient domain adaptation and better preservation of the model's general capabilities.

## What are intermediate checkpoints and why are they needed?


Intermediate checkpoints are snapshots of the Amazon Nova model saved at different stages of the pre-training, before the model reaches its final production-ready state. During model development, Amazon Nova undergoes multiple training phases: initial pre-training with constant learning rate, learning rate ramp-down, context extension training, and finally instruction-following alignment and safety training. For CPT, intermediate checkpoints are often preferable to the final Prod checkpoint because they are more plastic and receptive to domain adaptation. The Prod checkpoint has undergone extensive instruction-following alignment and safety training, which optimizes the model for general conversational use but can make it resistant to learning new domain-specific patterns during CPT. In contrast, Partially and Fully pre-trained text only checkpoints retain the model's pre-training characteristics. They haven't been heavily steered toward specific behaviors, making them more efficient starting points for domain adaptation. When performing large-scale CPT (>10B tokens), starting from intermediate checkpoints typically results in faster convergence, better training stability, and more effective domain knowledge acquisition. However, for small-scale CPT (<10B tokens), or when instruction-following capabilities need to be preserved, the Prod checkpoint may be more appropriate as it allows domain adaptation while maintaining the model's conversational abilities.

Multiple intermediate checkpoints are necessary for CPT because they offer different levels of model plasticity that affect how efficiently the model can absorb new domain knowledge. The final Prod checkpoint has undergone extensive instruction-following alignment and safety training, which optimizes it for general conversational use but makes it resistant to learning new domain-specific patterns. In other words, It has been hardened through post-training. In contrast, earlier checkpoints retain the model's pre-training characteristics and haven't been heavily steered toward specific behaviors, making them more plastic and receptive to domain adaptation.

To achieve the best training efficiency, multiple intermediate checkpoints are provided.

## What checkpoints are available?


**Nova 2.0**  
There are three Amazon Nova Lite 2.0 checkpoints.
+ PRE-TRAINED - [`nova-lite-2/pretraining-text-RD`]: This is the checkpoint after the constant learning rate and ramp-down stages of Amazon Nova pre-training where the model is trained on trillions of tokens.
+ MID-TRAINED - [`nova-lite-2/pretraining-text-CE`]: This checkpoint allows intermediate volumes of unstructured data to be introduced with a more conservative learning rate than pre-training, absorbing domain-specific knowledge while avoiding catastrophic forgetting.
+ POST-TRAINED - [`nova-lite-2/prod`]: This is the fully aligned final checkpoint of the model that has gone through all the pertaining and post training steps.

The following table elaborates on the different conditions for pre- and mid-training.


| Data Type | Perform | With Checkpoint | 
| --- |--- |--- |
| Large-scale unstructured raw domain data (documents, logs, articles, code, etc.) | Continued Pre-Training | Pre-Trained | 
| Large-scale unstructured raw domain data (documents, logs, articles, code, etc.) | Mid-Training | Pre-Trained | 
| Smaller volumes of unstructured raw data. Structured reasoning traces / CoT data | Mid-Training | Mid-Trained | 
| Structured demonstrations (high-quality input-output pairs, curated task instructions, multi-turn dialogues) | Full Fine-Tuning | Mid-Trained | 
| Structured demonstrations (high-quality input-output pairs, curated task instructions, multi-turn dialogues) | Parameter Efficient Fine-Tuning | Post-Trained | 

## Which checkpoint to use?


Partially pre-trained text only and fully pre-trained text only checkpoints typically converge faster and require fewer training steps for domain adaptation. However, they have no instruction tuning and would need to undergo post training steps to be able to perform useful tasks and follow instructions. GA checkpoint may require more steps to adapt but provides safer starting point for small-scale experiments and will maintain some of it post training capabilities even after CPT training.

In general, with large training datasets (>10B tokens), start from partially pre-trained text only or fully pre-trained text only checkpoints for more efficient and stable training, as the model's knowledge base will be substantially modified. With small datasets (<10B tokens), use the GA checkpoint to preserve instruction-following capabilities while adapting to the domain.

## How to use data mixing for Nova 2.0?


When performing CPT with a new domain data, it is highly beneficial to mix the new data with some of the data used previously in the pre-training stage of the model. Mixing old data with new domain data solves two problems:
+ Forgetting control: Prevents catastrophic forgetting by preserving existing skills and knowledge of the model. Without data mixing, training exclusively on narrow domain data causes the model to overwrite general capabilities. For example, a model trained only on legal documents might lose its ability to code or do math. Mixing the general domain datasets preserves these general skills while acquiring the new domain.
+ Optimization stability: Maintains training stability by anchoring the model's internal representations. During CPT, the model's learned features are modified and data mixing provides gradients from diverse sources that guide this adaptation smoothly. Without it, training on narrow distributions can cause gradient instability, where the model's representations shift too drastically, leading to training divergence, loss spikes, or collapse of existing capabilities. This is the stability-plasticity tradeoff: the model should be plastic enough to learn new domain knowledge, but stable enough not to break what it already knows.

**Nova CPT Data Mixing Capabilities**  
Access to Amazon Nova pre-training data and checkpoints is one of the core offerings of the Amazon Nova CPT customization. Amazon Nova CPT customization enables easy mixing of domain data with Amazon Nova's pre-training corpus. Further, the sampling ratio of the specific Amazon Nova data categories (e.g., code, math, reasoning, etc) can be changed and their proportions controlled to complement domain data. This allows reinforcement of capabilities that align with the use case while adapting the model to the specific domain.

**Finding the Optimal Mixing Ratio**  
The optimal ratio of Amazon Nova data versus domain data depends on the dataset's domain, complexity, size, quality, and the importance of maintaining general capabilities. This ratio must be discovered through experimentation. An experiment framework to decide on how much Amazon Nova data to mix is as follows.

Select a representative subset of domain data (e.g., 5B tokens) and keep this constant across all experimental runs.

Run small-scale CPT experiments varying only the amount of Amazon Nova data mixed in:
+ No mixing: 100% domain → 5B domain only (total 5B)
+ Light mixing: 90% domain → 5B domain \$1 \$10.56B Amazon Nova (total \$15.56B)
+ Medium mixing: 70% domain → 5B domain \$1 \$12.14B Amazon Nova (total \$17.14B)
+ Heavy mixing: 50% domain → 5B domain \$1 5B Amazon Nova (total 10B)

Evaluate each checkpoint on in domain and general domain benchmarks. Also evaluate the starting checkpoint (Amazon Nova checkpoint before any training).
+ Does customer-domain performance stay roughly constant across runs? It usually should, since each run saw the same number of domain tokens. If domain performance improves with more mixing, Amazon Nova data provides useful regularization.
+ Do general benchmark scores improve as mixing is increased?
  + Expected behavior is that the general capabilities should improve monotonically as more Amazon Nova data is added.
  + Measure multiple general benchmarks: MMLU (general knowledge), HumanEval (coding), GSM8K (math), or specific benchmarks of interest.
+ Select the mixing ratio that maintains domain performance while delivering acceptable general capabilities for the use cases. Factor in the additional cost of training with more data mixing.

Once the optimal mixing ratio has been identified, run full-scale CPT using the complete domain dataset with the selected mixing ratio.

## Dissecting the Data Mixing Categories


Below we dissect each available category in Data Mixing, for you to make best decision of what data categories makes most sense to be represented in your overall data mixture.

### How to Enable Data Mixing


Add the `data_mixing` section to your recipe with the appropriate percentage distribution across dataset categories. The `nova_data` percentages must sum to 100.

#### Nova 2.0 Configuration with data mixing


```
# Note:
# This recipe can run on p5.48xlarge

# Run config
display_name: "Nova Lite Pretrain on P5 GPU"
versions: ["2.0"]
instance_types: ["ml.p5.48xlarge"]

run:
  name: "my-cpt-run"     # A descriptive name for your training job
  model_type: "amazon.nova-2-lite-v1:0:256k" # Model variant specification, do not change
  model_name_or_path: "nova-lite-2/prod" # Base model path, do not change
  replicas: 8       # Number of compute instances for training, allowed values are 4, 8, 16, 32
  data_s3_path: ""       # Customer data paths
  validation_data_s3_path: ""        # Customer validation data paths
  output_s3_path: ""   # Output artifact path, SageMaker HyperPod job-specific configuration - not compatible with standard SageMaker Training jobs

## Training specific configs
training_config:
  task_type: cpt
  max_length: 8192              # Maximum context window size (tokens)
  global_batch_size: 64        # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
    max_steps: 10               # The number of training steps to run total
    val_check_interval: 10      # The number of steps between running validation
    limit_val_batches: 2        # Batches of the validation set to use each trigger

  model:
    hidden_dropout: 0.0           # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0        # Dropout for attention weights, must be between 0.0 and 1.0

  optim:
    optimizer: adam
    lr: 1e-5                      # Learning rate
    name: distributed_fused_adam  # Optimizer algorithm, do not change
    adam_w_mode: true             # Enable AdamW mode
    eps: 1e-06                    # Epsilon for numerical stability
    weight_decay: 0.0             # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9               # Beta1 for Adam optimizer
    adam_beta2: 0.95              # Beta2 for Adam optimizer
    sched:
      warmup_steps: 10            # Learning rate warmup steps
      constant_steps: 0           # Steps at constant learning rate
      min_lr: 1e-6                # Minimum learning rate, must be lower than lr

data_mixing:
  dataset_catalog: cpt_text_lite
  sources:
    nova_data:   # percent inputs for Nova data must sum to 100%; use 0% if you want to exclude a data grouping
      agents: 20
      business-and-finance: 4
      scientific: 10
      code: 5
      factual-and-news: 5
      longform-text: 6
      health-and-medicine: 1
      humanities-and-education: 1
      legal: 1
      math: 9
      additional-languages: 15
      social-and-personal-interest: 11
      entertainment: 0.5
      reasoning: 10
      other: 0.5
      tables: 1
    customer_data: # percent input of customer data. 100 = use only customer data, 0 = use only the nova_data mix above
      percent: 25
```

**What do these categories mean**

**Note**: Nova 2.0 includes additional reasoning-specific categories (e.g., `reasoning-code`, `reasoning-math`, `reasoning-instruction-following`) that are not available in Nova 1.0.

Summary of Categories and Info Labels:


| Category Name | Info detail | 
| --- | --- | 
| agents | Training data focused on autonomous decision-making, task completion, and goal-oriented behavior in AI systems | 
| baseline | Fundamental language data focused on general comprehension, basic communication, and core linguistic capabilities | 
| chat | Conversational exchanges demonstrating natural dialogue flow, context maintenance, and appropriate social interactions | 
| code | Programming source code, documentation, and technical discussions from various programming languages and platforms. | 
| factuality | Reference materials and verified information focused on accuracy, source validation, and truth assessment | 
| identity | Personality frameworks and behavioral patterns focused on consistent character traits, values, and interaction styles | 
| long-context | Extended texts and complex narratives focused on maintaining coherence and relevance across lengthy exchanges | 
| math | Mathematical content including textbooks, problems, solutions, and mathematical discussions. | 
| rai | Cases and scenarios emphasizing ethical AI principles, safety considerations, and responsible technology deployment | 
| instruction-following | Examples of precise task execution based on varying levels of user prompts and directives | 
| stem | Technical content covering science, technology, engineering, and mathematics, including problem-solving and theoretical concepts | 
| planning | Sequences demonstrating strategic thinking, step-by-step task breakdown, and efficient resource allocation | 
| reasoning-chat | Analytical dialogue scenarios focused on logical discussion and structured conversation flows | 
| reasoning-code | Programming challenges and algorithmic problems focused on systematic solution development | 
| reasoning-factuality | Information evaluation scenarios focused on critical assessment and verification processes | 
| reasoning-instruction-following | Complex task analysis focused on systematic interpretation and methodical execution | 
| reasoning-math | Mathematical problem-solving scenarios focused on logical progression and solution strategies | 
| reasoning-planning | Strategic decision-making scenarios focused on systematic approach to goal achievement | 
| reasoning-rag | Information retrieval and synthesis scenarios focused on contextual understanding and relevant application | 
| reasoning-rai | Ethical decision-making scenarios focused on systematic evaluation of AI safety and fairness | 
| reasoning-stem | Scientific problem-solving scenarios focused on methodical analysis and solution development | 
| rag | Examples of effectively combining retrieved external knowledge with generated responses to provide accurate, contextual information | 
| translation | Multi-language content pairs showing accurate translation while preserving context, tone, and cultural nuances | 

#### Parameter Guide

+ **dataset\$1catalog:** The only value is cpt\$1text\$1lite for now, until we enable the multimodal training.
+ **nova\$1data:** Percentage of the individual categories of Nova data when mixed in. They should add up to 1.0.
+ **customer\$1data**: the percentage of customer's data mixed into the Nova data.

The total number of tokens used in training can be calculated from `max_length` \$1 `global_batch_size` \$1 `max_steps`

**Limitations**  
Current CPT only supports text data and does not support any customer multi-modal datasets.

# Supervised Fine-Tuning


## Introduction


Supervised fine-tuning uses dataset with input-output pairs for the task of interest. In other words, you provide examples of prompts (questions, instructions, etc.) along with the correct or desired responses and continue training the model on these. The model's weights are adjusted to minimize a supervised loss, typically cross-entropy between its predictions and the target response tokens.

## When to use SFT?


SFT is best when you have a well-defined task with clear desired outputs. If you can explicitly say "Given X input, the correct/desired output is Y" and you can gather examples of such X-Y mappings, then supervised fine-tuning is a great choice. Some scenarios where SFT excels include:
+ **Structured or complex classification tasks**: e.g. classifying internal documents or contracts into many custom categories. With SFT, the model can learn these specific categories far better than prompting alone.
+ **Question-answering or transformation tasks with known answers**: e.g. fine-tuning a model to answer questions from a company's knowledge base, or to convert data between formats, where each input has a correct response.
+ **Formatting and style consistency**: If you need the model to always respond in a certain format or tone, you can fine-tune on examples of the correct format/tone. For instance, training on prompt-response pairs that demonstrate a particular brand voice or style can teach the model with that style in its outputs. Instruction-following behavior is often initially taught via SFT on curated examples of good assistant behavior.

SFT is the most direct way to teach an LLM a new skill or behavior when you can specify what the right behavior looks like. It leverages the model's existing language understanding and focuses it on your task. Do not use SFT when the gap is knowledge rather than behavior; it will not make the model learn new facts, jargon, or recent events. In those cases, prefer continued pre-training on large in-domain corpora or retrieval-augmented generation to bring external knowledge at inference. When you can measure quality but cannot label a single right answer, reinforcement fine-tuning with verifiable rewards or an LLM-as-judge might be preferable to SFT.

Depending on task complexity and performance of the Nova model without tuning, plan for thousands to tens of thousands of demonstrations per task, with data quality, consistency, and diversity mattering more than raw volume.

## When to use parameter efficient and when full rank SFT?


Nova customization recipes enable you to perform parameter efficient, in particular LoRA, or full rank SFT. If you want a straightforward, cost-efficient model update, or have very little data, favor parameter-efficient methods so you train small adapters while leaving most of the backbone untouched (full rank SFT updates all model parameters).

## Data Mixing for SFT


Data mixing allows you to combine your custom training datasets with Nova's proprietary training data. This feature is available for both Nova 1.0 and Nova 2.0 models.

**Nova Proprietary Data Type**: Nova supports both text and multimodal SFT data types. It is organized into multiple data categories each containing a blend of tasks relevant for the corresponding category.

**Nova Proprietary Data Categories**: Text datasets includes several categories including: autonomous decision making, task completion, goal oriented datasets (agents), both reasoning and non-reasoning precise task execution datasets (reasoning-instruction-following, instruction-following), sequences demonstrating strategic thinking and step-by-step task breakdown (planning), responsible AI (rai), long-context, factuality, math, stem and many more. Similarly, multimodal datasets includes video, screenshot, charts and many more.

The data mixing feature allows you to blend your own fine-tuning training samples with samples from the Nova datasets used to fine-tune the Nova. This can prevent overfitting on your custom training and "catastrophic forgetting" of Nova capabilities, or help you build capabilities when training from a new pretrained checkpoint.

To mix in Nova data, you simply need to add a data\$1mixing block to your recipe YAML file, under the training\$1config section. Text and multi-modal data mixing blocks have different content. Please refer to corresponding recipes.

### Supported Models

+ Nova 2.0 Lite

### Supported Modality

+ Text
+ Multimodal

## YAML Configuration Examples


### Nova 2.0 Configuration Example


```
run:
  name: my-lora-sft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://my-bucket-name/train.jsonl
  replicas: 4
  output_s3_path: s3://my-bucket-name/outputs/
  mlflow_tracking_uri: ""
  mlflow_experiment_name: "my-lora-sft-experiment"
  mlflow_run_name: "my-lora-sft-run"
  
training_config:
  max_steps: 100
  save_steps: 10
  save_top_k: 5
  max_length: 32768
  global_batch_size: 32
  reasoning_enabled: true
  lr_scheduler:
    warmup_steps: 15
    min_lr: 1e-6
  optim_config:
    lr: 1e-5
    weight_decay: 0.0
    adam_beta1: 0.9
    adam_beta2: 0.95
  peft:
    peft_scheme: "lora"
    lora_tuning:
      alpha: 64
      lora_plus_lr_ratio: 64.0
```

### Nova 2.0 Text Data Mixing


```
data_mixing:
  dataset_catalog: sft_1p5_text_chat
  sources:
    customer_data:
      percent: 50
    nova_data:
      agents: 1
      baseline: 10
      chat: 0.5
      code: 10
      factuality: 0.1
      identity: 1
      long-context: 1
      math: 2
      rai: 1
      instruction-following: 13
      stem: 0.5
      planning: 10
      reasoning-chat: 0.5
      reasoning-code: 0.5
      reasoning-factuality: 0.5
      reasoning-instruction-following: 45
      reasoning-math: 0.5
      reasoning-planning: 0.5
      reasoning-rag: 0.4
      reasoning-rai: 0.5
      reasoning-stem: 0.4
      rag: 1
      translation: 0.1
```

### Nova 2.0 Multimodal Data Mixing


```
data_mixing:
  dataset_catalog: sft_1p5_mm_chat
  sources:
    customer_data:
      percent: 50
    nova_data:
      charts: 1
      chat: 38
      code: 20
      docs: 3
      general: 2
      grounding: 1
      rag: 4
      screenshot: 4
      text: 8
      translation: 4
      video: 15
```

## Model Checkpoints


### Nova 2.0 Checkpoints

+ **PRE-TRAINED** [`nova-lite-2/pretraining-text-RD`]: Checkpoint after constant learning rate and ramp-down stages where model is trained on trillions of tokens. [Outcome of Stage 2]
+ **MID-TRAINED** [`nova-lite-2/pretraining-text-CE`]: Allows customers with intermediate volumes of unstructured data to introduce their data with a more conservative learning rate than pre-training, absorbing domain-specific knowledge while avoiding catastrophic forgetting. [Outcome of Stage 3]
+ **FINAL** [`nova-lite-2/prod`]: Fully aligned final checkpoint that has gone through all pretraining and post training steps. [Outcome of Stage 4]

**Training Stages:**
+ Stage 1: PT Ckpt, initial pre-training with constant learning rate
+ Stage 2: PT Ckpt, learning rate ramp-down
+ Stage 3: PT Ckpt, context extension training
+ Stage 4: instruction-following alignment and safety training

## Training Approaches


**Training Approach Selection Guide**  

| Data Type | Data Volume | Perform | With Checkpoint | 
| --- | --- | --- | --- | 
| Large-scale unstructured raw domain data (documents, logs, articles, code, etc.) | 1T\$1 Tokens | Continued Pre-Training | End of Constant Learning Rate (CLR) | 
| Large-scale unstructured raw domain data | 100B\$1 Tokens | Mid-Training | End of CLR | 
| Smaller volumes of unstructured raw data; Structured reasoning traces / CoT data | 1B\$1 Tokens | Mid-Training | Nova base model | 
| Structured demonstrations (high-quality input-output pairs, curated task instructions, multi-turn dialogues) | 1K\$1 Examples | Supervised Fine-Tuning (SFT) | Nova base model | 

## Pre-Requisites before you begin

+ We assume that you've already setup an SMHP cluster with a restricted instance group (RIG) that has active capacity. If not please refer here to get your SMHP Cluster and RIG setup completed [[Docs Link](http://docs.aws.amazon.com/sagemaker/latest/dg/nova-forge.html), [Workshop Link](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US)]
+ You will require **p5.48xlarge** EC2 instances to execute this recipe. The minimum number instances required to execute this recipe efficiently are as follows:
  + **Nova Lite 2.0 - 4 p5.48xlarge**
+ Install the Forge Specific SageMaker HyperPod CLI using the provided instructions [here](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US)
+ Confirm that you can connect to your cluster using `hyperpod get-clusters`
  + Note that this command will list all SMHP clusters in your account
+ Confirm that your training, and optionally validation data, is available in an S3 bucket that is accessible by the execution role of your SMHP cluster. For data preparation, refer to next section.
+ Have AWS CLI setup completed. If you have not completed the setup, please follow below [guide](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).
+ **Verification**: After completing the setup, confirm you can successfully run below commands

  ```
  aws sagemaker describe-cluster --cluster-name <cluster-name> --region <region>
  
  hyperpod connect-cluster --cluster-name cluster-name
  ```

## A systematic Approach to Achieving Successful SFT

+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using Amazon SageMaker HyperPod, this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (number of training steps, learning rate, batch size)
  + Optional components (distributed training parameters, etc)
  + Data Mixing setting (defines proportions of customer and Nova data categories)
+ **Optimize SFT Hyper Parameters**: SFT recipe parameter values we recommend are are a great starting point and a robust choice. If you want to optimize them further for your use case do multiple SFT runs with different parameter combinations and pick the best one. You can select parameter combinations following Hyper-Parameter Optimization method of your choice. A simple approach is to vary the value of one parameter (default\$10.5, default, default\$12) while keeping other default value for other parameters, repeat this for each parameter you want to optimize, and iterate if needed. The most relevant parameters for LoRA are learning rate, alpha (scaling parameter), number of epochs to train and warmup steps; for full-rank it is mainly the learning rate, number of epochs, and warmup steps.

## Experiment Sequencing and Data Mixing

+ If you have only SFT data (train/dev/test) for a set of tasks and care only about the test performance on these tasks
  + Do SFT without mixing on [FINAL] Nova checkpoint. Use the default SFT hyper-parameters and optionally optimize them for your use case. Monitor validation metrics and/or evaluate intermediate checkpoints for larger datasets.
+ If you have only SFT data (train/dev/test) for a set of tasks and care about test performance on these tasks and general benchmarks in the domain of interest
  + Start by doing SFT with Nova data mixing on a pre-training checkpoint (PRE-TRAINED or MID-TRAINED checkpoint, not FINAL). Using an intermediate checkpoint allows the model to better integrate your custom data with Nova's proprietary data while maintaining strong general capabilities.
  + Run shorter SFT training runs with varying amount of Nova data in the mix (e.g., 10%, 25%, 50%, 75%) and Nova data category selections that complement your use case (e.g., pick instruction following category if you care about general instruction following ability). Monitor validation metrics and evaluate if mixing helps performance on general benchmarks. Select the training mix and checkpoint that leads to the best combination of performance on your task and general performance. Depending on the use case, both task and general performances can be further improved using reinforcement fine tuning (RFT).

## Prepare Dataset for SFT


**Nova 2.0**: Use Converse API format [https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference-call.html](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference-call.html). Nova 2.0 data format can contain additional reasoning fields: [https://docs.aws.amazon.com/bedrock/latest/APIReference/API\$1runtime\$1ReasoningContentBlock.html](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ReasoningContentBlock.html)

Reasoning content captures the model's intermediate thinking steps before generating a final answer. In the `assistant` turn, use the `reasoningContent` field to include reasoning traces. Use plain text for reasoning content, avoid markup tags like `<thinking>` and `</thinking>` unless specifically required by your task, and ensure reasoning content is clear and relevant to the problem-solving process.

## Evaluation Methods


### Prerequisites

+ Checkpoint S3 URI from your training job's `manifest.json` file (for trained models)
+ Evaluation dataset uploaded to S3 in the correct format
+ Output S3 path for evaluation results

**Out of the box benchmarks**: Use out of the box benchmarks to validate the performance on general tasks. For more details check here: [https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-evaluate.html](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-evaluate.html)

### Bring Your Own Data


You can also supply your custom data by formatting them in below format and then using below mentioned containers to get inference results along with log probabilities for calibrations if needed.

Create jsonl per task with the following structure:

```
{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}
```

Outputs generated during inference phase of the evaluation job will have following structure:

```
{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}
```

**Field descriptions:**
+ `prompt`: Formatted input sent to the model
+ `inference`: Model's generated response
+ `gold`: Expected correct answer from input dataset, response field from the input
+ `metadata`: Optional metadata passed through from input

### Prepare Evaluation Config


Command to launch evaluation job. Use `"--override-parameters"` to modify any entry from the recipe.

```
hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_micro_p5_48xl_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'
```

## Best Practices

+ **Prioritize data quality over volume**: High-quality, diverse, and representative training data is more valuable than large quantities of low-quality data.
+ **Include reasoning-instruction-following category**: When using data mixing, include the "reasoning-instruction-following" category to maintain strong generic performance across tasks.
+ **Use default learning rates**: Start with default learning rates (1e-5 for LoRA, 5e-6 for full-rank SFT) and adjust only if needed based on validation metrics.
+ **Balance Nova data mixing**: Mix maximum 50% Nova data for optimal latency-performance balance. Higher percentages may improve general capabilities but can increase training time.
+ **Monitor validation metrics**: Regularly evaluate intermediate checkpoints during training to detect overfitting or performance degradation early.
+ **Test on representative datasets**: Ensure your evaluation datasets accurately represent your production use cases for meaningful performance assessment.

## Prepare Training Job Config


### Hyper Parameters


Full set of hyper-parameters other than data mixing:

```
## Run config
run:
  name: my-lora-sft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://my-bucket-name/train.jsonl  # SageMaker HyperPod (SMHP) only and not compatible with SageMaker Training jobs. Note replace my-bucket-name with your real bucket name for SMHP job
  replicas: 4                      # Number of compute instances for training, allowed values are 4, 8, 16, 32
  output_s3_path: s3://my-bucket-name/outputs/               # Output artifact path (Hyperpod job-specific; not compatible with standard SageMaker Training jobs). Note replace my-bucket-name with your real bucket name for SMHP job
  
  ## MLFlow configs
  mlflow_tracking_uri: "" # Required for MLFlow
  mlflow_experiment_name: "my-lora-sft-experiment" # Optional for MLFlow. Note: leave this field non-empty
  mlflow_run_name: "my-lora-sft-run" # Optional for MLFlow. Note: leave this field non-empty
  
training_config:
  max_steps: 100                   # Maximum training steps. Minimal is 4.
  save_steps: 10 # This parameter suggests after how many training steps the checkpoints will be saved. Should be less than or equal to max_steps(please override this value with a numerical value equal or less than max_steps value; min: 4)
  save_top_k: 5                    # Keep top K best checkpoints. Note supported only for SageMaker HyperPod jobs. Minimal is 1.
  max_length: 32768                # Sequence length (options: 8192, 16384, 32768 [default], 65536)
  global_batch_size: 32            # Golbal batch size (options: 32, 64, 128)
  reasoning_enabled: true          # If data has reasoningContent, set to true; otherwise False

  lr_scheduler:
    warmup_steps: 15               # Learning rate warmup steps. Recommend 15% of max_steps
    min_lr: 1e-6                   # Minimum learning rate, must be between 0.0 and 1.0

  optim_config:                    # Optimizer settings
    lr: 1e-5                       # Learning rate, must be between 0.0 and 1.0
    weight_decay: 0.0              # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9                # Exponential decay rate for first-moment estimates, must be between 0.0 and 1.0
    adam_beta2: 0.95               # Exponential decay rate for second-moment estimates, must be between 0.0 and 1.0

  peft:                            # Parameter-efficient fine-tuning (LoRA)
    peft_scheme: "lora"            # Enable LoRA for PEFT
    lora_tuning:
      alpha: 64                    # Scaling factor for LoRA weights ( options: 32, 64, 96, 128, 160, 192),
      lora_plus_lr_ratio: 64.0     # LoRA+ learning rate scaling factor (0.0–100.0)
```

The most relevant parameters for LoRA are learning rate, alpha (scaling parameter), number of epochs to train and warmup steps; for full-rank it is mainly the learning rate, number of epochs, and warmup steps. The recipes are pre-populated with the recommended defaults.

## Set Up Data Mixing Block


Add the data\$1mixing section to your recipe with the appropriate percentage distribution across dataset categories.

Below we describe each available Nova data category.

### Nova 2.0 Configuration with data mixing


```
data_mixing:
  dataset_catalog: sft_1p5_text_chat       # Nova text dataset catalog
  sources:
    customer_data:
      percent: 50                 # Percent of overall mix to draw from customer data
    nova_data:                    # The remainder will be drawn from Nova data. The categories below must add to 100
      agents: 1                   # autonomous decision-making, task completion, goal-oriented behavior in AI systems
      baseline: 10                 # [New in Nova 1.5]
      chat: 0.5                    # Conversational exchanges demonstrating natural dialogue flow
      code: 10                     # Programming examples and solutions spanning multiple languages
      factuality: 0.1               # [New in Nova 1.5]
      identity: 1                 # [New in Nova 1.5]
      long-context: 1             # [New in Nova 1.5]
      math: 2                     # [New in Nova 1.5]
      rai: 1                      # ethical AI principles, safety considerations, and responsible technology deployment
      instruction-following: 13   # precise task execution based on varying levels of user prompts and directives
      stem: 0.5                     # Technical content covering science, technology, engineering, and mathematics
      planning: 10                 # Sequences demonstrating strategic thinking and step-by-step task breakdown
      reasoning-chat: 0.5
      reasoning-code: 0.5
      reasoning-factuality: 0.5
      reasoning-instruction-following: 45
      reasoning-math: 0.5
      reasoning-planning: 0.5
      reasoning-rag: 0.4
      reasoning-rai: 0.5
      reasoning-stem: 0.4
      rag: 1                      # combining retrieved external knowledge with generated responses
      translation: 0.1
```

What do these categories mean?


**Nova 2.0 Text Data Categories**  

| Category Name | Info detail | 
| --- | --- | 
| agents | Training data focused on autonomous decision-making, task completion, and goal-oriented behavior in AI systems | 
| baseline | Fundamental language data focused on general comprehension, basic communication, and core linguistic capabilities | 
| chat | Conversational exchanges demonstrating natural dialogue flow, context maintenance, and appropriate social interactions | 
| code | Programming source code, documentation, and technical discussions from various programming languages and platforms. | 
| factuality | Reference materials and verified information focused on accuracy, source validation, and truth assessment | 
| identity | Personality frameworks and behavioral patterns focused on consistent character traits, values, and interaction styles | 
| long-context | Extended texts and complex narratives focused on maintaining coherence and relevance across lengthy exchanges | 
| math | Mathematical content including textbooks, problems, solutions, and mathematical discussions. | 
| rai | Cases and scenarios emphasizing ethical AI principles, safety considerations, and responsible technology deployment | 
| instruction-following | Examples of precise task execution based on varying levels of user prompts and directives | 
| stem | Technical content covering science, technology, engineering, and mathematics, including problem-solving and theoretical concepts | 
| planning | Sequences demonstrating strategic thinking, step-by-step task breakdown, and efficient resource allocation | 
| reasoning-chat | Analytical dialogue scenarios focused on logical discussion and structured conversation flows | 
| reasoning-code | Programming challenges and algorithmic problems focused on systematic solution development | 
| reasoning-factuality | Information evaluation scenarios focused on critical assessment and verification processes | 
| reasoning-instruction-following | Complex task analysis focused on systematic interpretation and methodical execution | 
| reasoning-math | Mathematical problem-solving scenarios focused on logical progression and solution strategies | 
| reasoning-planning | Strategic decision-making scenarios focused on systematic approach to goal achievement | 
| reasoning-rag | Information retrieval and synthesis scenarios focused on contextual understanding and relevant application | 
| reasoning-rai | Ethical decision-making scenarios focused on systematic evaluation of AI safety and fairness | 
| reasoning-stem | Scientific problem-solving scenarios focused on methodical analysis and solution development | 
| rag | Examples of effectively combining retrieved external knowledge with generated responses to provide accurate, contextual information | 
| translation | Multi-language content pairs showing accurate translation while preserving context, tone, and cultural nuances | 

### Multimodal Data Mixing (Nova 2.0)


```
data_mixing:
  dataset_catalog: sft_1p5_mm_chat       # Nova text dataset catalog
  sources:
    customer_data:
      percent: 50                 # Percent of overall mix to draw from customer data
    nova_data:                    # The remainder will be drawn from Nova data. The categories below must add to 100
      charts: 1
      chat: 38
      code: 20
      docs: 3
      general: 2
      grounding: 1
      rag: 4
      screenshot: 4
      text: 8
      translation: 4
      video: 15
```

Note: Nova 2.0 includes video data category support that is not available in Nova 1.0.

What do these categories mean?


**Nova 2.0 Multimodal Data Categories**  

| Category Name | Info detail | 
| --- | --- | 
| charts | Visual representations and descriptions of graphs, pie charts, bar charts, line plots, and other statistical visualizations to help the model understand and communicate quantitative information effectively | 
| chat | Conversational data paired with visual elements focused on contextual dialogue understanding and image-based interactions | 
| code | Programming interfaces and development environments focused on visual code interpretation, IDE screenshots, and technical diagrams | 
| docs | Document-centric data combining text, images, layouts, and formatting to train models in understanding and processing various document types and structures to help with concepts like PDF content recognition | 
| general | Diverse visual-textual content focused on broad comprehension of images, graphics, and accompanying descriptive text | 
| grounding | Visual reference materials and labeled imagery focused on connecting language concepts to real-world visual representations | 
| rag | Multimodal retrieval examples showing how to effectively combine and reference visual and textual external knowledge to generate accurate, contextual responses | 
| screenshot | Application interface captures and digital display images focused on understanding software interfaces and digital interactions | 
| text | A balanced pool of contextual text data create from the text-only SFT Nova dataset categories in order to provide generalist abilities | 
| translation | Cross-language visual content focused on multilingual interpretation of text in images and cultural visual elements | 
| video | Motion-based visual content focused on temporal understanding and sequential visual-narrative comprehension | 

## How to Launch a job


You can also refer to the README, if you only need to get the essential details to kick off first SFT run.

Container Information:


**Container Information and Launch Commands**  

| Model | Technique | Subcategory | Image URI | Hyperpod Launcher Command | 
| --- | --- | --- | --- | --- | 
| Nova 2.0 | Fine-tuning | SFT Text | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest | hyperpod start-job \$1 -n kubeflow \$1 --recipe fine-tuning/nova/nova\$12\$10/nova\$1lite/SFT/nova\$1lite\$12\$10\$1p5\$1gpu\$1sft \$1 --override-parameters '\$1 "instance\$1type": "ml.p5.48xlarge", "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest" \$1' | 
| Nova 2.0 | Fine-tuning | SFT Text \$1 Datamixing | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-DATAMIXING-latest | hyperpod start-job \$1 -n kubeflow \$1 --recipe fine-tuning/nova/forge/nova\$12\$10/nova\$1lite/SFT/nova\$1lite\$12\$10\$1p5\$1gpu\$1sft\$1text\$1with\$1datamix \$1 --override-parameters '\$1 "instance\$1type": "ml.p5.48xlarge", "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-DATAMIXING-latest" \$1' | 
| Nova 2.0 | Fine-tuning | SFT MM | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest | hyperpod start-job \$1 -n kubeflow \$1 --recipe fine-tuning/nova/nova\$12\$10/nova\$1lite/SFT/nova\$1lite\$12\$10\$1p5\$1gpu\$1sft \$1 --override-parameters '\$1 "instance\$1type": "ml.p5.48xlarge", "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-latest" \$1' | 
| Nova 2.0 | Fine-tuning | SFT MM \$1 Datamixing | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-DATAMIXING-latest | hyperpod start-job \$1 -n kubeflow \$1 --recipe fine-tuning/nova/forge/nova\$12\$10/nova\$1lite/SFT/nova\$1lite\$12\$10\$1p5\$1gpu\$1sft\$1mm\$1with\$1datamix \$1 --override-parameters '\$1 "instance\$1type": "ml.p5.48xlarge", "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-DATAMIXING-latest" \$1' | 

Once you're all setup, starting from the root of the sagemaker-hyperpod-cli repository, navigate to the default nova sft recipe folder
+ cd /src/hyperpod\$1cli/sagemaker\$1hyperpod\$1recipes/recipes\$1collection/recipes/training/nova/
+ Here you can choose whether you want to run nova 1 or nova 2 recipies based on the choice of base model.

For Nova 2.0 sft:
+ If you would like to use a regular sft job , You should be able to see one recipe under this folder
  + cd /src/hyperpod\$1cli/sagemaker\$1hyperpod\$1recipes/recipes\$1collection/recipes/fine-tuning/nova\$12\$10/nova\$1lite/SFT and then you should be able to see one recipe under this folder called nova\$1lite\$12\$10\$1p5x8\$1gpu\$1sft.yaml
+ If you would like to use datamixing sft Job, you can navigate to the sft Forge recipes folder
  + cd /src/hyperpod\$1cli/sagemaker\$1hyperpod\$1recipes/recipes\$1collection/recipes/fine-tuning/nova/forge/nova\$12\$10/nova\$1lite/SFT and you should be able to see one recipe under this folder called: nova\$1lite\$12\$10\$1p5x8\$1gpu\$1sft\$1with\$1datamix.yaml
+ Edit the sections in the recipe required by the job such as name, data\$1s3\$1path, validation\$1s3\$1path, output\$1s3\$1path, and max\$1steps. Since we're performing sft, the notion of epochs doesn't apply here.

The data mixing config will look the same, but with an extra data mixing section at the bottom similar to this

```
data_mixing:
  dataset_catalog: sft_text_lite
  sources:
    nova_data:   # percent inputs for Nova data must sum to 100%; use 0% if you want to exclude a data grouping
      agents: 20
      business-and-finance: 20
      scientific: 20
      code: 20
      factual-and-news: 20
      longform-text: 0
      health-and-medicine: 0
      humanities-and-education: 0
      legal: 0
      math: 0
      additional-languages: 0
      social-and-personal-interest: 0
      entertainment: 0
      reasoning: 0
      other: 0
      tables: 0
    customer_data: # percent input of customer data. 100 = use only customer data, 0 = use only the nova_data mix above
      percent: 25
```

There are two top-level categories of data here:
+ nova\$1data : This is the actual data mixing and is sub-divided into even more categories. It is imperative that they sum up to 100%
  + A complete breakdown of these categories including token count can be found in below
+ customer\$1data : This is your training data referred in the data\$1s3\$1path key at the top of your yaml. The percentage provided here determines what the resulting percentage will be for nova\$1data. For example, in the above percent selections, during training we'll use 25% of customer\$1data and 75% of nova\$1data of which 15% will be agents, 15% will be business-and-finance, 15% will be scientific, 15% will be code, and 15% will be factual-and-news

Tip: Run pip install -e . once again and you're ready to submit your job\$1

We'll be overriding a couple of parameters here to use data mixing:

```
hyperpod start-job \
 -n kubeflow \
 --recipe fine-tuning/nova/forge/nova_2_0/nova_lite/SFT/nova_lite_2_0_p5x8_gpu_sft_with_datamix \
 --override-parameters '{
 "instance_type": "ml.p5.48xlarge",
 "recipes.run.name": "nova-sft-datamixing",
 "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-Datamix",
 "recipes.run.data_s3_path": "s3://sft-data/sft_train_data.jsonl",
 "recipes.run.validation_data_s3_path": "s3://sft-data/sft_val_data.jsonl",
 "recipes.run.output_s3_path": "s3://sft-data/output/
 }'
```

Your output should contain a job name as follows:

```
⚡ MY Desktop ⚡ % hyperpod start-job \
 -n kubeflow \
 --recipe training/nova/forge/nova_2_0/nova_lite/sft/nova_lite_2_0_p5x8_gpu_pretrain_with_datamix \
 --override-parameters '{
 "instance_type": "ml.p5.48xlarge",
 "recipes.run.name": "nova-sft-datamixing",
 "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-V2-Datamix",
 "recipes.run.data_s3_path": "s3://sft-data/sft_train_data.jsonl",
 "recipes.run.validation_data_s3_path": "3://sft-data/sft_val_data.jsonl",
 "recipes.run.output_s3_path": "3://sft-data/output/
 }'
```

Output would be like this:

```
Final command: python3 /local/home/my/Downloads/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=training/nova/nova_micro_p5x8_gpu_pretrain cluster_type=k8s cluster=k8s base_results_dir=/local/home/niphaded/Downloads/sagemaker-hyperpod-cli/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" instance_type="p5d.48xlarge" container="900867814919.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:sft-datamix-rig-final"
Prepared output directory at /local/home/my/Downloads/sagemaker-hyperpod-cli/results/my-sft-run-wzdyn/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/my/Downloads/sagemaker-hyperpod-cli/results/my-sft-run-wzdyn/niphaded-sft-run-wzdyn_launch.sh
Running Helm script: /local/home/my/Downloads/sagemaker-hyperpod-cli/results/my-sft-run-wzdyn/niphaded-sft-run-wzdyn_launch.sh

NAME: my-sft-run-wzdyn
LAST DEPLOYED: Tue Aug 26 16:21:06 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: /local/home/my/Downloads/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/hyperpod-eks-ga-0703"
}
```

You can view the status of your job using hyperpod list-pods -n kubeflow --job-name my-sft-run-wzdyn

```
hyperpod list-pods -n kubeflow --job-name my-sft-run-wzdyn 
{
 "pods": [
  {
   "PodName": "my-sft-run-wzdyn-master-0",
   "Namespace": "kubeflow",
   "Status": "Pending",
   "CreationTime": "2025-08-26 16:21:06+00:00"
  },
  {
   "PodName": "my-sft-run-wzdyn-worker-0",
   "Namespace": "kubeflow",
   "Status": "Pending",
   "CreationTime": "2025-08-26 16:21:06+00:00"
  }
 ]
}
```

or directly use the kubectl command to find them.

For example,

```
kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-sft-run)

NAME                                                         READY   STATUS      RESTARTS   AGE     IP              NODE                           NOMINATED NODE   READINESS GATES
my-sft-run-5suc8-master-0                              0/1     Completed   0          3h23m   172.31.32.132   hyperpod-i-00b3d8a1bf25714e4   <none>           <none>
my-sft-run-5suc8-worker-0                              0/1     Completed   0          3h23m   172.31.44.196   hyperpod-i-0aa7ccfc2bd26b2a0   <none>           <none>
my-sft-run-5suc8-worker-1                              0/1     Completed   0          3h23m   172.31.46.84    hyperpod-i-026df6406a7b7e55c   <none>           <none>
my-sft-run-5suc8-worker-2                              0/1     Completed   0          3h23m   172.31.28.68    hyperpod-i-0802e850f903f28f1   <none>           <none>
```

Pro tip : Make sure to always use the -o wide flag since the EKS node on which the job runs will help you find your logs even faster in the AWS UI

## How to Monitor Job


You can view your logs one of three ways:

### a) Using CloudWatch


Your logs are available in your Amazon Web Services account that contains the hyperpod cluster under CloudWatch. To view them in your browser, navigate to the CloudWatch homepage in your account and search for your cluster name. For example, if your cluster were called my-hyperpod-rig the log group would have the prefix:
+ Log group : /aws/sagemaker/Clusters/my-hyperpod-rig/\$1UUID\$1
+ Once you're in the log group, you can find your specific log using the node instance ID such as - hyperpod-i-00b3d8a1bf25714e4.
  + i-00b3d8a1bf25714e4 here represents the hyperpod friendly machine name where your training job is running. Recall how in the previous command kubectl get pods -o wide -w -n kubeflow \$1 (head -n1 ; grep my-cpt-run) output we captured a column called NODE.
  + The "master" node run was in this case running on hyperpod-i-00b3d8a1bf25714e4 and thus we'll use that string to select the log group to view. Select the one that says SagemakerHyperPodTrainingJob/rig-group/[NODE]

Your logs should look something like this:

### b) Using CloudWatch Insights


If you have your job name handy and don't wish to go through all the steps above, you can simply query all logs under /aws/sagemaker/Clusters/my-hyperpod-rig/\$1UUID\$1 to find the individual log.

CPT

```
fields @timestamp, @message, @logStream, @log 
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc 
| limit 100
```

For job completion replace Starting SFT Job with SFT Job completed

Then you can click through the results and pick the one that says "Epoch 0" since that will be your master node.

### C) Using the AWS CLI


You may choose to tail your logs using the AWS CLI. Before doing so, please check your AWS CLI version using `aws --version`. It is also recommended to use this utility script that helps in live log tracking in your terminal

for V1:

```
aws logs get-log-events \
 --log-group-name /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
 --start-from-head | jq -r '.events[].message'
```

for V2:

```
aws logs tail /aws/sagemaker/YourLogGroupName \
  --log-stream-name YourLogStream \
 --since 10m \
 --follow
```

### D) Set up ML Flow:


You can track metrics via MLFlow.

Create an MLflow app

Using Studio UI: If you create a training job through the Studio UI, a default MLflow app is created automatically and selected by default under Advanced Options.

Using CLI: If you use the CLI, you must create an MLflow app and pass it as an input to the training job API request.

```
mlflow_app_name="<enter your MLflow app name>"  
role_arn="<enter your role ARN>"   
bucket_name="<enter your bucket name>"   
region="<enter your region>"  
  
mlflow_app_arn=$(aws sagemaker create-mlflow-app \  
  --name $mlflow_app_name \  
  --artifact-store-uri "s3://$bucket_name" \  
  --role-arn $role_arn \  
  --region $region)
```

Access the MLflow app

Using CLI: Create a presigned URL to access the MLflow app UI:

```
aws sagemaker create-presigned-mlflow-app-url \  
  --arn $mlflow_app_arn \  
  --region $region \  
  --output text
```

Once ML Flow is setup, you can pass the URI in your recipe or use override when starting the job. One example of how to do that can be found in README.

## How to evaluate your model after SFT?


### Prerequisites

+ Checkpoint S3 URI from your training job's manifest.json file (for trained models)
+ Evaluation dataset uploaded to S3 in the correct format
+ Output S3 path for evaluation results

Out of the box benchmarks: Use out of the box benchmarks to validate the performance on general tasks. For more details check here.

### Bring your own data:


You can also supply your custom data by formatting them in below format and then using below mentioned containers to get inference results along with log probabilities for calibrations if needed.

Crate jsonl per task with the following structure:

```
{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}
```

Outputs generated during inference phase of the evaluation job will have following structure:

```
{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}
```

Field descriptions:
+ prompt: Formatted input sent to the model
+ inference: Model's generated response
+ gold: Expected correct answer from input dataset, response field from the input
+ metadata: Optional metadata passed through from input

### Prepare Evaluation Config


Command to launch evaluation job. Use "--override-parameters" to modify any entry from the recipe.

```
hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_micro_p5_48xl_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'
```

### Launch Your Evaluation Job


Job launching commands for different recipes with corresponding images.


**Evaluation Job Launch Commands**  

| Model | Technique | Subcategory | Image URI | Command | 
| --- | --- | --- | --- | --- | 
| Nova 2.0 | Evaluation | Eval | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest | hyperpod start-job -n kubeflow \$1 --recipe evaluation/nova/nova\$12\$10/nova\$1lite/nova\$1lite\$12\$10\$1p5\$148xl\$1gpu\$1ft\$1eval \$1 --override-parameters '\$1 "instance\$1type": "ml.p5.48xlarge", "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest" \$1' | 

## Lessons Learned and Tips

+ The quality of the SFT dataset is critical. You should make every effort to filter out low-quality data. If you have a small subset of exceptionally high-quality data—in terms of both complexity and accuracy—you may consider placing it toward the end of training to help the model converge better.
+ We leverage both text and multimodal (MM) datasets for data mixing. Our experiments with text dataset show that adding Nova's proprietary "reasoning-instruction-following" category significantly improves performance across generic benchmarks. We recommend including this category in your data mixing strategy if you care about generic benchmark that is regressed after you did SFT with your datasets.
+ For MM datasets, our experiments indicate that incorporating over 20% of video categories into the mix is beneficial for maintaining generic benchmark performance.
+ Further, SFT with data mixing is quite sensitive to learning rate so our finding suggests to fine-tune with the default learning rate i.e. 1e-5 for LoRA and 5e-6 for FR.
+ Finally, there is a trade off between latency and performance if you mix Nova proprietary datasets so our findings suggest to mix 50% in max as a good balance.

# Reinforcement Learning
Reinforcement Learning

**Note**  
Detailed documentation is provided once subscribed

Nova Forge provides advanced reinforcement learning capabilities with the option to use remote reward functions in your own environment. Customers can choose to integrate their own endpoint to execute validation for immediate real-world feedback, or even use their own orchestrator to coordinate agentic multi-turn evaluations in your environment.

## Bring your own orchestrator for agentic multi-turn evaluations


For Forge users requiring multi-turn conversations or reward functions exceeding 15-minute timeouts, Nova Forge provides Bring Your Own Orchestration (BYOO) capabilities. This allows you to coordinate agentic multi-turn evaluations in your environment (e.g., using chemistry tools to score molecular designs, or robotics simulations that reward efficient task completion and penalize collisions).

**Topics**
+ [

### Architecture overview
](#nova-hp-rft-forge-architecture)
+ [

### Setup and execution
](#nova-hp-rft-forge-setup)

### Architecture overview


The BYOO architecture provides full control over the rollout and generation process through customer-managed infrastructure.

**Training VPC:**
+ **Rollout**: Coordinates training by delegating rollout generation to customer infrastructure
+ **Trainer**: Performs model weight updates based on received rollouts

**Customer VPC (such as ECS on EC2):**
+ **Proxy Lambda**: Receives rollout requests and coordinates with customer infrastructure
+ **Rollout Response SQS**: Queue for returning completed rollouts to training infrastructure
+ **Generate Request SQS**: Queue for model generation requests
+ **Generate Response SQS**: Queue for model generation responses
+ **Customer Container**: Implements custom orchestration logic (can use provided starter kit)
+ **DynamoDB**: Stores and retrieves state across the orchestration process

**Workflow:**

1. Rollout delegates rollout generation to Proxy Lambda

1. Proxy Lambda pushes rollout API request to Generate Request SQS

1. Customer container processes requests, manages multi-turn interactions, and calls reward functions

1. Container stores and retrieves state from DynamoDB as needed

1. Container pushes rollout responses to Rollout Response SQS

1. Rollout sends completed rollouts to Trainer for weight updates

### Setup and execution


For detailed setup instructions, recipe configurations, request and response formats, and environment examples, refer to the confidential documentation provided to Nova Forge subscribers. To get the Nova Forge documents follow the below steps:

```
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
```

Once the assets are downloaded, you can find all the documentation under the `docs` folder.

# Responsible AI Toolkit and content moderation
Responsible AI Toolkit

## Responsible AI toolkit


Nova Forge provides a Responsible AI Toolkit that includes training and evaluation data to align models to Amazon Nova's responsible AI guidelines during the training process, and runtime controls to moderate model responses during inference.

**Training data** – Cases and scenarios emphasizing responsible AI principles, safety considerations, and responsible technology deployment are available for data mixing to align models responsibly during continued pre-training.

**Evaluations** – Evaluations testing the model's ability to detect and reject inappropriate, harmful, or incorrect content are available as a benchmark task to determine the delta between base model performance and custom model performance.

**Runtime controls** – By default, Amazon Nova's runtime controls moderate model responses during inference. To modify these runtime controls, request Amazon Nova's Customizable Content Moderation Settings by contacting an AWS account manager.

Safety is a shared responsibility between AWS and its users. Changing the base model or using continued pre-training to improve performance on a specific use case can impact safety, fairness, and other properties of the new model. A robust adaptation method minimizes changes to the safety, fairness, and other protections built into base models while minimizing impact on model performance for tasks the model was not customized for. End-to-end testing of applications on datasets representative of use cases is required to determine if test results meet specific expectations of safety, fairness, and other properties, as well as overall effectiveness. For more information, see Amazon Web Services Responsible Use of AI Guide, Amazon Web Services Responsible AI Policy, Amazon Web Services Acceptable Use Policy, and Amazon Web Services Service Terms.