What are intermediate checkpoints and why are they needed?What checkpoints are available?Which checkpoint to use?How to use data mixing for 1.0 or 2.0 models?Dissecting the Data Mixing Categories

Continued Pre-Training and Mid-Training

Note

Detailed documentation is provided once subscribed

Nova Forge CPT offers advanced capabilities beyond standard CPT, including access to intermediate checkpoints and data mixing with Nova's pre-training corpus. These features enable more efficient domain adaptation and better preservation of the model's general capabilities.

What are intermediate checkpoints and why are they needed?

Intermediate checkpoints are snapshots of the Amazon Nova model saved at different stages of the pre-training, before the model reaches its final production-ready state. During model development, Amazon Nova undergoes multiple training phases: initial pre-training with constant learning rate, learning rate ramp-down, context extension training, and finally instruction-following alignment and safety training. For CPT, intermediate checkpoints are often preferable to the final Prod checkpoint because they are more plastic and receptive to domain adaptation. The Prod checkpoint has undergone extensive instruction-following alignment and safety training, which optimizes the model for general conversational use but can make it resistant to learning new domain-specific patterns during CPT. In contrast, Partially and Fully pre-trained text only checkpoints retain the model's pre-training characteristics. They haven't been heavily steered toward specific behaviors, making them more efficient starting points for domain adaptation. When performing large-scale CPT (>10B tokens), starting from intermediate checkpoints typically results in faster convergence, better training stability, and more effective domain knowledge acquisition. However, for small-scale CPT (<10B tokens), or when instruction-following capabilities need to be preserved, the Prod checkpoint may be more appropriate as it allows domain adaptation while maintaining the model's conversational abilities.

Multiple intermediate checkpoints are necessary for CPT because they offer different levels of model plasticity that affect how efficiently the model can absorb new domain knowledge. The final Prod checkpoint has undergone extensive instruction-following alignment and safety training, which optimizes it for general conversational use but makes it resistant to learning new domain-specific patterns. In other words, It has been hardened through post-training. In contrast, earlier checkpoints retain the model's pre-training characteristics and haven't been heavily steered toward specific behaviors, making them more plastic and receptive to domain adaptation.

To achieve the best training efficiency, multiple intermediate checkpoints are provided.

What checkpoints are available?

Nova 1.0

Amazon Nova 1.0 family has three models (Micro, Lite, Pro) and for each model there are three checkpoints available.

PRE-TRAINED - [nova-<micro/lite/pro>/pretraining-text-partial]: This is the checkpoint after the constant learning rate stage of Amazon Nova pre-training where the model is trained on trillions of text tokens.
MID-TRAINED - [nova-<micro/lite/pro>/pretraining-text-full]: This is the text-only checkpoint after all the stages of Amazon Nova pre-training and mid-training with trillions of text tokens have finished. Use these if the model specifically should not have seen any multi-modal data.
MID-TRAINED - [nova-<lite/pro>/pretraining-mm-full]: This is the checkpoint after all the stages of Amazon Nova pre-training and mid-training, including multi-modal data, with trillions of tokens have been processed.
POST-TRAINED - [nova-<micro/lite/pro>/prod]: This is the fully aligned final checkpoint of the model that has gone through all the pre-training and post training steps.

Nova 2.0

There are three Amazon Nova Lite 2.0 checkpoints.

PRE-TRAINED - [nova-lite-2/pretraining-text-RD]: This is the checkpoint after the constant learning rate and ramp-down stages of Amazon Nova pre-training where the model is trained on trillions of tokens.
MID-TRAINED - [nova-lite-2/pretraining-text-CE]: This checkpoint allows intermediate volumes of unstructured data to be introduced with a more conservative learning rate than pre-training, absorbing domain-specific knowledge while avoiding catastrophic forgetting.
POST-TRAINED - [nova-lite-2/prod]: This is the fully aligned final checkpoint of the model that has gone through all the pertaining and post training steps.

The following table elaborates on the different conditions for pre- and mid-training.

Data Type	Perform	With Checkpoint
Large-scale unstructured raw domain data (documents, logs, articles, code, etc.)	Continued Pre-Training	Pre-Trained
Large-scale unstructured raw domain data (documents, logs, articles, code, etc.)	Mid-Training	Pre-Trained
Smaller volumes of unstructured raw data. Structured reasoning traces / CoT data	Mid-Training	Mid-Trained
Structured demonstrations (high-quality input-output pairs, curated task instructions, multi-turn dialogues)	Full Fine-Tuning	Mid-Trained
Structured demonstrations (high-quality input-output pairs, curated task instructions, multi-turn dialogues)	Parameter Efficient Fine-Tuning	Post-Trained

Which checkpoint to use?

Partially pre-trained text only and fully pre-trained text only checkpoints typically converge faster and require fewer training steps for domain adaptation. However, they have no instruction tuning and would need to undergo post training steps to be able to perform useful tasks and follow instructions. GA checkpoint may require more steps to adapt but provides safer starting point for small-scale experiments and will maintain some of it post training capabilities even after CPT training.

In general, with large training datasets (>10B tokens), start from partially pre-trained text only or fully pre-trained text only checkpoints for more efficient and stable training, as the model's knowledge base will be substantially modified. With small datasets (<10B tokens), use the GA checkpoint to preserve instruction-following capabilities while adapting to the domain.

How to use data mixing for 1.0 or 2.0 models?

When performing CPT with a new domain data, it is highly beneficial to mix the new data with some of the data used previously in the pre-training stage of the model. Mixing old data with new domain data solves two problems:

Forgetting control: Prevents catastrophic forgetting by preserving existing skills and knowledge of the model. Without data mixing, training exclusively on narrow domain data causes the model to overwrite general capabilities. For example, a model trained only on legal documents might lose its ability to code or do math. Mixing the general domain datasets preserves these general skills while acquiring the new domain.
Optimization stability: Maintains training stability by anchoring the model's internal representations. During CPT, the model's learned features are modified and data mixing provides gradients from diverse sources that guide this adaptation smoothly. Without it, training on narrow distributions can cause gradient instability, where the model's representations shift too drastically, leading to training divergence, loss spikes, or collapse of existing capabilities. This is the stability-plasticity tradeoff: the model should be plastic enough to learn new domain knowledge, but stable enough not to break what it already knows.

Nova CPT Data Mixing Capabilities

Access to Amazon Nova pre-training data and checkpoints is one of the core offerings of the Amazon Nova CPT customization. Amazon Nova CPT customization enables easy mixing of domain data with Amazon Nova's pre-training corpus. Further, the sampling ratio of the specific Amazon Nova data categories (e.g., code, math, reasoning, etc) can be changed and their proportions controlled to complement domain data. This allows reinforcement of capabilities that align with the use case while adapting the model to the specific domain.

Finding the Optimal Mixing Ratio

The optimal ratio of Amazon Nova data versus domain data depends on the dataset's domain, complexity, size, quality, and the importance of maintaining general capabilities. This ratio must be discovered through experimentation. An experiment framework to decide on how much Amazon Nova data to mix is as follows.

Select a representative subset of domain data (e.g., 5B tokens) and keep this constant across all experimental runs.

Run small-scale CPT experiments varying only the amount of Amazon Nova data mixed in:

No mixing: 100% domain → 5B domain only (total 5B)
Light mixing: 90% domain → 5B domain + ~0.56B Amazon Nova (total ~5.56B)
Medium mixing: 70% domain → 5B domain + ~2.14B Amazon Nova (total ~7.14B)
Heavy mixing: 50% domain → 5B domain + 5B Amazon Nova (total 10B)

Evaluate each checkpoint on in domain and general domain benchmarks. Also evaluate the starting checkpoint (Amazon Nova checkpoint before any training).

Does customer-domain performance stay roughly constant across runs? It usually should, since each run saw the same number of domain tokens. If domain performance improves with more mixing, Amazon Nova data provides useful regularization.
Do general benchmark scores improve as mixing is increased?
- Expected behavior is that the general capabilities should improve monotonically as more Amazon Nova data is added.
- Measure multiple general benchmarks: MMLU (general knowledge), HumanEval (coding), GSM8K (math), or specific benchmarks of interest.
Select the mixing ratio that maintains domain performance while delivering acceptable general capabilities for the use cases. Factor in the additional cost of training with more data mixing.

Once the optimal mixing ratio has been identified, run full-scale CPT using the complete domain dataset with the selected mixing ratio.

Dissecting the Data Mixing Categories

Below we dissect each available category in Data Mixing, for you to make best decision of what data categories makes most sense to be represented in your overall data mixture.

How to Enable Data Mixing

Add the data_mixing section to your recipe with the appropriate percentage distribution across dataset categories. The nova_data percentages must sum to 100.

Nova 1.0 Configuration with Data mixing



run:
  name: "cpt-job-name"             # A descriptive name for your training job
  model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification, do not change
  model_name_or_path: "nova-lite/prod"
  replicas: 4 
  data_s3_path: "s3://path/to/data/xyz.jsonl"
  output_s3_path: "s3://path/to/output/checkpoint"
  
skip_recipe_validation: true
training_config:
  max_length: 32768
  global_batch_size: 64
  trainer:
    max_steps: 5000
  model:
    hidden_dropout: 0.1
    attention_dropout: 0.1
    ffn_dropout: 0.1
    optim:
      lr: 1.5e-05
      name: distributed_fused_adam
      adam_w_mode: true
      eps: 1.0e-06
      weight_decay: 0.05
      betas:
      - 0.9
      - 0.999
      sched:
        warmup_steps: 500
        constant_steps: 0
        min_lr: 1.5e-06
        
data_mixing:
  dataset_catalog: cpt_text_lite
  sources:
    nova_data:
      en-entertainment: 0.11%
      en-factual: 4.83%
      en-legal: 0.48%
      en-long-form-text: 6.26%
      en-mined: 16.79%
      en-other: 1.79%
      en-scientific: 10.53%
      en-social: 12.43%
      en-techqa: 13.95%
      code: 7.50%
      high-util-lang: 8.05%
      low-util-lang: 6.51%
      math: 8.76%
      en-finance: 1%
      tables: 1%
    customer_data:
      percent: 90

What do these categories mean

Category Name	Info detail
`en-entertainment`	Media and entertainment content including video transcripts, game dialogue, and entertainment-focused discussions.
`en-factual`	Reference material, encyclopedic content, educational resources, and factual documentation focused on conveying accurate information.
`en-finance`	Financial texts including market reports, economic analyses, investment strategies, financial news articles, earnings reports, and other finance-related content that helps the model understand economic concepts and financial terminology.
`en-legal`	Legal documents, court proceedings, contracts, laws, regulations, and legal analysis texts.
`en-long-form-text`	Extended writings including books, academic papers, lengthy articles, and other substantial text documents.
`en-mined`	Text data extracted from various web sources, including forums, comments, discussions, and general web content and rewritten to ensure high training performance.
`en-other`	Miscellaneous English language content that doesn't fit clearly into other categories.
`en-scientific`	Scientific papers, research documents, technical reports, and scientific discussions across various fields.
`en-social`	Social media posts, conversations, discussions, and other forms of social communication.
`en-techqa`	Technical documentation, user guides, FAQ pages, technical forums, and Q content related to technology.
`code`	Programming source code, documentation, and technical discussions from various programming languages and platforms.
`high-util-lang`	Text content in languages with large amounts of available training data, incl. German (DE), Italian (IT), Spanish (ES), French (FR), Hindi (HI), Japanese (JP), Arabic (AR) and Portuguese (PT)
`low-util-lang`	Text content in additional spoken languages with smaller amounts of available training data.
`math`	Mathematical content including textbooks, problems, solutions, and mathematical discussions.
`tables`	Structured data in tabular format including spreadsheets, databases, CSV files, statistical tables, financial reports, and other row-column organized information that helps the model understand and work with structured data relationships and patterns.

Nova 2.0 Configuration with data mixing



# Note:
# This recipe can run on p5.48xlarge

# Run config
display_name: "Nova Lite Pretrain on P5 GPU"
versions: ["2.0"]
instance_types: ["ml.p5.48xlarge"]

run:
  name: "my-cpt-run"     # A descriptive name for your training job
  model_type: "amazon.nova-2-lite-v1:0:256k" # Model variant specification, do not change
  model_name_or_path: "nova-lite-2/prod" # Base model path, do not change
  replicas: 8       # Number of compute instances for training, allowed values are 4, 8, 16, 32
  data_s3_path: ""       # Customer data paths
  validation_data_s3_path: ""        # Customer validation data paths
  output_s3_path: ""   # Output artifact path, SageMaker HyperPod job-specific configuration - not compatible with standard SageMaker Training jobs

## Training specific configs
training_config:
  task_type: cpt
  max_length: 8192              # Maximum context window size (tokens)
  global_batch_size: 64        # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
    max_steps: 10               # The number of training steps to run total
    val_check_interval: 10      # The number of steps between running validation
    limit_val_batches: 2        # Batches of the validation set to use each trigger

  model:
    hidden_dropout: 0.0           # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0        # Dropout for attention weights, must be between 0.0 and 1.0

  optim:
    optimizer: adam
    lr: 1e-5                      # Learning rate
    name: distributed_fused_adam  # Optimizer algorithm, do not change
    adam_w_mode: true             # Enable AdamW mode
    eps: 1e-06                    # Epsilon for numerical stability
    weight_decay: 0.0             # L2 regularization strength, must be between 0.0 and 1.0
    adam_beta1: 0.9               # Beta1 for Adam optimizer
    adam_beta2: 0.95              # Beta2 for Adam optimizer
    sched:
      warmup_steps: 10            # Learning rate warmup steps
      constant_steps: 0           # Steps at constant learning rate
      min_lr: 1e-6                # Minimum learning rate, must be lower than lr

data_mixing:
  dataset_catalog: cpt_text_lite
  sources:
    nova_data:   # percent inputs for Nova data must sum to 100%; use 0% if you want to exclude a data grouping
      agents: 20
      business-and-finance: 4
      scientific: 10
      code: 5
      factual-and-news: 5
      longform-text: 6
      health-and-medicine: 1
      humanities-and-education: 1
      legal: 1
      math: 9
      additional-languages: 15
      social-and-personal-interest: 11
      entertainment: 0.5
      reasoning: 10
      other: 0.5
      tables: 1
    customer_data: # percent input of customer data. 100 = use only customer data, 0 = use only the nova_data mix above
      percent: 25

What do these categories mean

Note: Nova 2.0 includes additional reasoning-specific categories (e.g., reasoning-code, reasoning-math, reasoning-instruction-following) that are not available in Nova 1.0.

Summary of Categories and Info Labels:

Category Name	Info detail
`agents`	Training data focused on autonomous decision-making, task completion, and goal-oriented behavior in AI systems
`baseline`	Fundamental language data focused on general comprehension, basic communication, and core linguistic capabilities
`chat`	Conversational exchanges demonstrating natural dialogue flow, context maintenance, and appropriate social interactions
`code`	Programming source code, documentation, and technical discussions from various programming languages and platforms.
`factuality`	Reference materials and verified information focused on accuracy, source validation, and truth assessment
`identity`	Personality frameworks and behavioral patterns focused on consistent character traits, values, and interaction styles
`long-context`	Extended texts and complex narratives focused on maintaining coherence and relevance across lengthy exchanges
`math`	Mathematical content including textbooks, problems, solutions, and mathematical discussions.
`rai`	Cases and scenarios emphasizing ethical AI principles, safety considerations, and responsible technology deployment
`instruction-following`	Examples of precise task execution based on varying levels of user prompts and directives
`stem`	Technical content covering science, technology, engineering, and mathematics, including problem-solving and theoretical concepts
`planning`	Sequences demonstrating strategic thinking, step-by-step task breakdown, and efficient resource allocation
`reasoning-chat`	Analytical dialogue scenarios focused on logical discussion and structured conversation flows
`reasoning-code`	Programming challenges and algorithmic problems focused on systematic solution development
`reasoning-factuality`	Information evaluation scenarios focused on critical assessment and verification processes
`reasoning-instruction-following`	Complex task analysis focused on systematic interpretation and methodical execution
`reasoning-math`	Mathematical problem-solving scenarios focused on logical progression and solution strategies
`reasoning-planning`	Strategic decision-making scenarios focused on systematic approach to goal achievement
`reasoning-rag`	Information retrieval and synthesis scenarios focused on contextual understanding and relevant application
`reasoning-rai`	Ethical decision-making scenarios focused on systematic evaluation of AI safety and fairness
`reasoning-stem`	Scientific problem-solving scenarios focused on methodical analysis and solution development
`rag`	Examples of effectively combining retrieved external knowledge with generated responses to provide accurate, contextual information
`translation`	Multi-language content pairs showing accurate translation while preserving context, tone, and cultural nuances

Parameter Guide

dataset_catalog: The only value is cpt_text_lite for now, until we enable the multimodal training.
nova_data: Percentage of the individual categories of Nova data when mixed in. They should add up to 1.0.
customer_data: the percentage of customer's data mixed into the Nova data.

The total number of tokens used in training can be calculated from max_length * global_batch_size * max_steps

Limitations

Current CPT only supports text data and does not support any customer multi-modal datasets.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Nova Forge access and setup

Supervised Fine-Tuning