

# Fine-tuning Amazon Nova models on SageMaker HyperPod
<a name="nova-hp-fine-tune"></a>

The following techniques show you how to fine-tune Amazon Nova 2 models on SageMaker HyperPod.

**Topics**
+ [Supervised fine-tuning (SFT)](nova-fine-tune.md)
+ [Direct preference optimization (DPO)](nova-dpo.md)
+ [Proximal policy optimization (PPO)](nova-ppo.md)

# Supervised fine-tuning (SFT)
<a name="nova-fine-tune"></a>

The SFT training process consists of two main stages:
+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using , this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (epochs, learning rate, batch size)
  + Optional components (distributed training parameters, etc)

## Nova Model Comparison and Selection
<a name="nova-model-comparison"></a>

Amazon Nova 2.0 is a model trained on a larger and more diverse dataset than Amazon Nova 1.0. Key improvements include:
+ **Enhanced reasoning abilities** with explicit reasoning mode support
+ **Broader multilingual performance** across additional languages
+ **Improved performance on complex tasks** including coding and tool use
+ **Extended context handling** with better accuracy and stability at longer context lengths

## When to Use Nova 1.0 vs. Nova 2.0
<a name="nova-model-selection"></a>

Choose Amazon Nova 1.0 when:
+ The use case requires standard language understanding without advanced reasoning
+ Performance has already been validated on Amazon Nova 1.0 and additional capabilities are not needed

# SFT on Nova 1.0
<a name="nova-sft-1"></a>

Supervised fine-tuning (SFT) is the process of providing a collection of prompt-response pairs to a foundation model to improve the performance of a pre-trained foundation model on a specific task. The labeled examples are formatted as prompt-response pairs and phrased as instructions. This fine-tuning process modifies the weights of the model.

You should use SFT when you have domain-specific data that requires providing specific prompt-response pairs for optimal results.

Note that your training and validation input datasets must reside in customer-owned buckets, not in escrow, or service-managed S3 buckets.

## Data requirements
<a name="nova-sft-1-data-requirements"></a>

For full-rank SFT and low-rank adapter (LoRA) SFT, the data should follow the [Amazon Bedrock Converse operation format](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html). For examples and constraints of this format, see [Preparing data for fine-tuning Understanding models](https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-prepare-data-understanding.html).

To validate your dataset format before submission, we recommend using [the validation script from the Amazon Bedrock samples repository](https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/bedrock-finetuning/understanding/dataset_validation/nova_ft_dataset_validator.py). This validation tool helps ensure that your JSONL files adhere to the required format specifications and identify any potential issues before you submit your fine-tuning job.

The Amazon Nova parameters that are available for tuning with SFT are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Amazon Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model to use for your training. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `data_s3_path`: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations within the provided S3 path must be in the same account and Region.
  + `validation_data_s3_path`: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and Region as the cluster. All of the S3 locations within the provided S3 path must be in the same account and Region.
  + `output_s3_path`: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations within the provided S3 path must be in the same account and region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 65,536 tokens for SFT.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution. 
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce over-fitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. Valid values are between 1e-6-1e-3, inclusive. We recommend values between 1e-6-1e-4 for good performance.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

## Quick start with a full-rank SFT recipe
<a name="nova-sft-1-quick-start"></a>

The following is a recipe for full-rank SFT that's intended for you to quickly start an SFT job on a SageMaker HyperPod cluster. This recipe also assumes that you have connected to your SageMaker HyperPod cluster using the correct AWS credentials.

```
run:
  name: "my-sft-micro-job" # gets appended with a unique ID for HP jobs
  model_type: "amazon.nova-micro-v1:0:128k"
  model_name_or_path: "nova-micro/prod"
  replicas: 2
  data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
  validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
  output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## training specific configs
training_config:
  max_length: 32768
  save_steps: 100000
  replicas: ${recipes.run.replicas}
  micro_batch_size: 1
  task_type: sft
  global_batch_size: 64
  weights_only: True
  allow_percentage_invalid_samples: 10

  exp_manager:
    exp_dir: null
    create_wandb_logger: False
    create_tensorboard_logger: True
      project: null
      name: null
    checkpoint_callback_params:
      monitor: step
      save_top_k: 10
      mode: max
      every_n_train_steps: ${recipes.training_config.save_steps}
      save_last: True
    create_early_stopping_callback: True
    early_stopping_callback_params:
      min_delta: 0.001
      mode: min
      monitor: "val_loss"
      patience: 2

  trainer:
    log_every_n_steps: 1
    max_epochs: -1
    max_steps: 16
    limit_test_batches: 0
    gradient_clip_val: 1.0
    num_nodes: ${recipes.training_config.replicas}

  model:
    hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
    attention_dropout: 0.0 # Dropout probability in the attention layer.
    ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
    sequence_parallel: True
    optim:
      lr: 1e-5
      name: distributed_fused_adam
      bucket_cap_mb: 10
      contiguous_grad_buffer: False
      overlap_param_sync: False
      contiguous_param_buffer: False
      overlap_grad_sync: False
      adam_w_mode: true
      eps: 1e-06
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999
      sched:
        name: CosineAnnealing
        warmup_steps: 10
        constant_steps: 0
        min_lr: 1e-6

    mm_cfg:
      llm:
        freeze: false
      image_projector:
        freeze: true
        require_newline: true
      video_projector:
        freeze: true
        require_newline: false

    peft:
      peft_scheme: null

    training_validation:
      loader:
        args:
          data_loader_workers: 1
          prefetch_factor: 2
      collator:
        args:
          force_image_at_turn_beginning: false
```

## Sample full-rank recipe
<a name="nova-sft-1-sample-recipe"></a>

The following is a sample full-rank recipe for SFT with all components properly configured.

```
## Run config
run:
    name: "my-sft-run"              # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification
    model_name_or_path: "nova-lite/prod"      # Base model path
    replicas: 4                     # Number of compute instances for training
    data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
    validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 32768               # Maximum context window size (tokens)

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states
        attention_dropout: 0.0       # Dropout for attention weights
        ffn_dropout: 0.0             # Dropout for feed-forward networks

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-06               # Epsilon for numerical stability
            weight_decay: 0.0        # L2 regularization strength
            betas:                   # Adam optimizer betas
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        peft:
            peft_scheme: null        # Set to null for full-parameter fine-tuning
```

## Limitations
<a name="nova-sft-1-limitations"></a>

Publishing metrics to Weights & Biases is not supported.

To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

## Parameter-efficient fine-tuning (PEFT)
<a name="nova-fine-tune-peft"></a>

Parameter-efficient fine-tuning (PEFT) involves retraining a small number of additional weights to adapt a foundation model to new tasks or domains. Specifically, low-rank adapter (LoRA) PEFT efficiently fine-tunes foundation models by introducing low-rank trainable weight matrices into specific model layers, reducing the number of trainable parameters while maintaining model quality.

A LoRA PEFT adapter augments the base foundation model by incorporating lightweight adapter layers that modify the model’s weights during inference while keeping the original model parameters intact. This approach is also considered one of the most cost-effective fine-tuning techniques. For more information, see [Fine-tune models with adapter inference components](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-adapt.html).

You should use LoRA PEFT in the following scenarios:
+ You want to start with a fast training procedure.
+ The base model's performance is already satisfactory. In this case, the goal of LoRA PEFT is to enhance its capabilities across multiple related tasks, such as text summarization or language translation. LoRA PEFT's regularization properties help prevent overfitting and mitigate the risks of the model "forgetting" the source domain. This ensures the model remains versatile and adaptable to various applications.
+ You want to perform instruction fine-tuning scenarios with relatively small datasets. LoRA PEFT performs better with smaller, task-specific datasets than broader, larger datasets.
+ You have large, labeled datasets that exceed the Amazon Bedrock customization data limits. In this case, you can use LoRA PEFT on SageMaker AI to generate better results.
+ If you have already achieved promising results through Amazon Bedrock fine-tuning, LoRA PEFT in SageMaker AI can help further optimize the model hyperparameters.

The Amazon Nova parameters that are available for with LoRA PEFT include:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model to use for your training. Select the model to use. The available options are `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model you use. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `output_s3_path`: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations within the provided S3 path must be in the same account and region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 65,536 tokens for LoRA PEFT.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset. You can set either `max_steps` or `max_epochs`, but we do not recommend setting both. The maximum value is 5.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **LoRA configuration parameters**
  + `peft_scheme`: Set to `lora` to enable low-rank adaptation. 
  + `alpha`: The scaling factor for LoRA weights. This is typically set to same value as `adapter_dim`.
  + `adaptor_dropout`: The regularization parameter for LoRA.

### PEFT recipe
<a name="nova-sft-1-peft-recipe"></a>

The following is a recipe for LoRA PEFT.

```
## Run config
run:
    name: "my-lora-run"             # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification
    model_name_or_path: "nova-lite/prod"      # Base model path
    replicas: 4                     # Number of compute instances for training
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 32768               # Maximum context window size (tokens)

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states
        attention_dropout: 0.0       # Dropout for attention weights
        ffn_dropout: 0.0             # Dropout for feed-forward networks

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-06               # Epsilon for numerical stability
            weight_decay: 0.0        # L2 regularization strength
            betas:                   # Adam optimizer betas
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        peft:
            peft_scheme: "lora"      # Enable LoRA for parameter-efficient fine-tuning
            lora_tuning:
                loraplus_lr_ratio: 8.0  # LoRA+ learning rate scaling factor
                alpha: 32            # Scaling factor for LoRA weights
                adapter_dropout: 0.01  # Regularization for LoRA parameters
```

### Troubleshooting
<a name="nova-sft-1-troubleshooting"></a>

Use the following information to help resolve issues that you might encounter:
+ The input dataset for both training and validation should reside in customer-owned buckets, not in escrow, or service-managed S3 buckets.
+ If you receive a Region not found error in the AWS CLI, resubmit the job with the region prepended to the start-job command. For example: `AWS_REGION=us-east-1 hyperpod start-job ...Job Parameters`.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

# Direct preference optimization (DPO)
<a name="nova-dpo"></a>

Direct preference optimization (DPO) is an efficient fine-tuning method for foundation models that uses paired comparison data to align model outputs with human preferences. This approach enables direct optimization of model behavior based on human feedback about which responses are more desirable.

Both full-rank DPO and low-rank adapter (LoRA) DPO are available.

**Data format requirements**  
For both full-rank and LoRA DPO, the training data format requirements are similar to SFT. However, for DPO, the final turn needs to have preference pairs. Here is an example of the DPO data format:

```
// N-1 turns same as SFT format
{
    "role": "assistant",
    "candidates": [
        {
            "content": [
                {
                    "text": "..."
                } // content list can contain multiple 'text' objects
            ],
            "preferenceLabel": "preferred"
        },
        {
            "content": [
                {
                    "text": "..."
                } // content list can contain multiple 'text' objects
            ],
            "preferenceLabel": "non-preferred"
        }
    ]
}
```

Here is another complete DPO text sample:

```
{
    "system": [
        {
            "text": "..."
        }
    ],
    "messages":[
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "assistant",
            "candidates": [
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "preferred"
                },
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "non-preferred"
                }
            ]
        }
    ],
}
```

Here is a complete DPO image sample:

```
{
    "system": [
        {
            "text": "..."
        }
    ],
    "messages":[
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                },
                {
                    "text": "..."
                },
                {
                    "image": {
                        "format": "jpeg",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.jpg",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                } // "content" can have multiple "text" and "image" objects.
                 // max image count is 10
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                },
                {
                    "text": "..."
                },
                {
                    "image": {
                        "format": "jpeg",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.jpg",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                } // "content" can have multiple "text" and "image" objects.
                 // max image count is 10
            ]
        },
        {
            "role": "assistant",
            "candidates": [
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "preferred"
                },
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "non-preferred"
                }
            ]
        }
    ],
}
```

Other constraints on the input datasets apply. For more information, see [Dataset constraints](https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-prepare-data-understanding.html#custom-fine-tune-constraints). We recommend that you include a minimum of 1,000 preference pairs for effective training. High-quality preference data leads to more efficient results.

We recommend using DPO in the following scenarios:
+ Optimizing for subjective outputs that require alignment with specific human preferences.
+ Adjusting the model’s tone, style, or content characteristics to match desired response patterns.
+ Making targeted improvements to an existing model based on user feedback and error analysis.
+ Maintaining consistent output quality across different use cases.
+ Implementing safety guardrails through preferred response patterns.
+ Training with reward-free reinforcement learning.
+ Using only preference data instead of graded or labeled data.
+ Improving the model in nuanced alignment tasks, such as helpfulness, harmlessness, or honesty.

## Full-rank DPO
<a name="customize-fine-tune-hyperpod-dpo-fr"></a>

The Amazon Nova parameters that are available for full-rank DPO are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `data_s3_path`: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
  + `validation_data_s3_path`: The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 32,768 tokens for DPO.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **DPO configuration**
  + `beta`: Determines how closely the model should fit the training data or the original model. Valid values are between 0.001-0.5, inclusive.

    Specify larger values (for example, 0.5) to preserve more of the reference model behavior while more slowly learning new preferences. Specify smaller values (for example, 0.01-0.05) to more quickly learn new preferences at the risk of diverging from the reference model behavior.

**Full-rank DPO recipe**  
The following is a full-rank recipe for DPO

```
## Run config
run:
  name: "my-dpo-micro-job"             # A descriptive name for your training job
  model_type: "amazon.nova-micro-v1:0:128k"  # Model variant specification, do not change
  model_name_or_path: "nova-micro/prod"      # Base model path, do not change
  replicas: 2                     # Number of compute instances for training, allowed values are 2, 4, 8
  data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
  validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
  output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
  max_length: 32768               # Maximum context window size (tokens).
  global_batch_size: 64           # Global batch size, allowed values are 16, 32, 64.

  trainer:
    max_epochs: 2                # Number of training epochs

  model:
    hidden_dropout: 0.0          # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0       # Dropout for attention weights, must be between 0.0 and 1.0
    ffn_dropout: 0.0             # Dropout for feed-forward networks, must be between 0.0 and 1.0

    optim:
      lr: 1e-5                 # Learning rate
      name: distributed_fused_adam  # Optimizer algorithm, do not change
      adam_w_mode: true        # Enable AdamW mode
      eps: 1e-06               # Epsilon for numerical stability
      weight_decay: 0.0        # L2 regularization strength, must be between 0.0 and 1.0
      betas:                   # Adam optimizer betas, must be between 0.0 and 1.0
        - 0.9
        - 0.999
      sched:
        warmup_steps: 10     # Learning rate warmup steps
        constant_steps: 0    # Steps at constant learning rate
        min_lr: 1e-6         # Minimum learning rate, must be lower than lr

    dpo_cfg:
        beta: 0.1               # Strength of preference enforcement. Limits: [0.001, 0.5]

    peft:
        peft_scheme: null        # Disable LoRA, trigger full rank fine tuning
```

## Low-rank adapter DPO
<a name="customize-fine-tune-hyperpod-dpo-lora"></a>

The Amazon Nova parameters that are available for low-rank adapter DPO are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 32,768 tokens for DPO.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **DPO configuration**
  + `beta`: Determines how closely the model should fit the training data or the original model. Valid values are between 0.001-0.5, inclusive.

    Specify larger values (for example, 0.5) to preserve more of the reference model behavior while more slowly learning new preferences. Specify smaller values (for example, 0.01-0.05) to more quickly learn new preferences at the risk of diverging from the reference model behavior.
+ **LoRA configuration parameters**
  + `peft_scheme`: Set to `lora` to enable Low-Rank Adaptation, which generates a more efficient, smaller output model. These LoRA-specific properties are also available:
    + `alpha`: The scaling factor for LoRA weights. This is typically set to same value as `adapter_dim`.
    + `adapter_dropout`: The regularization parameter for the LoRA parameters.

**LoRA DPO recipe**  
The following is a recipe for LoRA DPO.

```
## Run config
run:
    name: "my-lora-run"             # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification, do not change
    model_name_or_path: "nova-lite/prod"      # Base model path, do not change
    replicas: 4                     # Number of compute instances for training. All supported values: {4, 8, 16}
    data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
    validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 16384               # Maximum context window size (tokens). Should be between [1024, 32768] and multiple of 1024.
                                    # Note: Image dataset for DPO has a limit on 20k samples and 16384 max_length
    global_batch_size: 64           # Total samples per step. Limits: {16, 32, 64, 128, 256}

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states. Limits: [0.0, 1.0]
        attention_dropout: 0.0       # Dropout for attention weights. Limits: [0.0, 1.0]
        ffn_dropout: 0.0             # Dropout for feed-forward networks. Limits: [0.0, 1.0]

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm, do not change
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-08               # Epsilon for numerical stability
            weight_decay: 0.01       # L2 regularization strength
            betas:                   # Adam optimizer betas. Limits: [0.0, 1.0]
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        dpo_cfg:
            beta: 0.01               # Strength of preference enforcement. Limits: [0.001, 0.5]

        peft:
            peft_scheme: "lora"      # Enable LoRA for parameter-efficient fine-tuning
            lora_tuning:
                loraplus_lr_ratio: 20.0  # LoRA+ learning rate scaling factor. Limits: [0.0, 100.0]
                alpha: 64            # Scaling factor for LoRA weights. [32, 64, 96, 128, 160, 192]
                adapter_dropout: 0.01  # Regularization for LoRA parameters. Limits: [0.0, 1.0]
```

**Limitations**  
DPO has the following limitations:
+ Intermediate checkpoints are not saved for evaluation and you can't resume from an intermediate checkpoint. Only the last checkpoint is saved.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

# Proximal policy optimization (PPO)
<a name="nova-ppo"></a>

Proximal policy optimization (PPO) is the process of using several machine learning models to train and score a model. The following models are part of the PPO process:
+ **Actor train or policy model**: A supervised fine-tuning (SFT) model that gets fine-tuned and updated every epoch. The updates are made by sampling prompts, generating completions, and updating weights using a clipped-surrogate objective. This limits the per-token log-profitability change so that each policy step is *proximal* to the previous one, preserving training stability.
+ **Actor generation model**: A model that generates prompt completions or responses to be judged by the reward model and critic model. The weights of this model are updated from the actor train or policy model each epoch.
+ **Reward model**: A model with frozen weights that's used to score the actor generation model.
+ **Critic model**: A model with unfrozen weights that's used to score the actor generation model. This score is often viewed as an estimate of the total reward the actor receives when generating the remaining tokens.
+ **Anchor model**: An SFT model with frozen weights that is used to calculate the KL divergence between the actor train model and the base model. The anchor model ensures that the updates to the actor model are not too drastic compared to the base model. Drastic changes can lead to instability or performance degradation.

The training data must be in JSONL format where each line contains a single JSON object that represents a training example. Here is an example:

```
{
    "turns": ["string", "string", ...], // Required
    "turns_to_mask": [integer, integer, ...], // Required
    "reward_category": "string", // Required
    "meta_data": {} // Optional
}
```
+ `turns` is an array of conversation string arrays that represent the dialog sequence. This line contains system prompts, user messages, and bot responses. User messages typically end with "Bot: " to indicate where the model output begins. For example, `[["System prompt"], ["User: Question Bot:"], ["Bot response"]]`.
+ `turns_to_mask` is an array of 0-based indices that identify which turns should not receive gradient updates. The masked turns are typically system prompts and user turns. For example, `[0, 1, 3]` masks the system prompt and user messages (the first and third messages).
+ `reward_category` is a string that identifies what aspects of model performance to evaluate. It's used to select the appropriate reward model category during training. The reward category is available for the following reward categories: `default`, `math`, `coding`, `if`, `rag`, and `rai`.
+ `meta_data` is an optional object that contains additional contextual or ground-truth information. This can include identifiers, source information, or conversation context. The structure is flexible based on your dataset needs.

Here is an example record:

```
{
    "turns": ["You are a helpful AI assistant.",
        "User: What is ML? Bot:",
        "Machine learning is...", "User: Examples? Bot:",
        "Email spam filtering is..."
    ],
    "turns_to_mask": [0, 1, 3],
    "reward_category": "default",
    "meta_data": {
        "messages": [{
                "role": "system",
                "content": "You are a helpful AI assistant."
            },
            {
                "role": "user",
                "content": "What is ML?"
            },
            {
                "role": "assistant",
                "content": "Machine learning is..."
            },
            {
                "role": "user",
                "content": "Examples?"
            },
            {
                "role": "assistant",
                "content": "Email spam filtering is..."
            }
        ]
    }
}
```

The reward modeling framework implements multi-dimensional optimization across distinct categorical objectives to facilitate robust model convergence. The reward category should be selected based on the task that the model must be optimized for. 

We recommend the following guidelines for selecting the right framework for your tasks:
+ `default`: A general purpose optimizer for standard conversational tasks and basic interactions. Used for general conversations and discussions, basic writing tasks, simple question answering, and non-specialized knowledge queries. 

  Here is an example:

  ```
  {
      "turns": ["Write a summary of climate change"],
      "turns_to_mask": [0],
      "reward_category": "default"
  }
  ```
+ `math`: A specialized optimizer for mathematical computations and numerical reasoning tasks. Used for mathematical problem-solving, arithmetic calculations, algebraic equations, geometric problems, and statistical analysis.

  Here is an example:

  ```
  {
      "turns": ["Calculate the derivative of x²"],
      "turns_to_mask": [0],
      "reward_category": "math"
  }
  ```
+ `coding`: A dedicated category for programming and software development-related queries. Used for code implementation, debugging assistance, algorithm design, technical documentation, and system architecture questions.

  Here is an example:

  ```
  {
      "turns": ["Write a function to check if a string is palindrome"],
      "turns_to_mask": [0],
      "reward_category": "coding"
  }
  ```
+ `if`: A category for tasks that require precise procedural execution and step-by-step guidance. Used for multi-step procedures, sequential instructions, complex task decomposition, and process documentation.

  Here is an example:

  ```
  {
      "turns": ["Provide steps to deploy a web application"],
      "turns_to_mask": [0],
      "reward_category": "if"
  }
  ```
+ `rag`: A reward category for tasks that require answering queries based specifically on retrieved contextual information. Used when responses should be derived directly from provided reference materials, synthesizing factual content without going beyond the scope of retrieved information, ensuring answers are grounded in the supplied context rather than general knowledge.

  Here is an example:

  ```
  {
              "turns": ["The Synthesis Report integrates findings from all six IPCC assessment cycles, revealing that global surface temperature has increased 1.1°C from 1850-1900 to 2011-2020, with human activities unequivocally identified as the cause of this warming. Alarmingly, current policies put the world on track for 3.2°C warming by 2100. The document identifies 5 key climate system "tipping points" approaching and emphasizes that greenhouse gas emissions must decline 43% by 2030 (compared to 2019 levels) to limit warming to 1.5°C. Climate-related risks will escalate with every increment of warming, with loss and damage disproportionately affecting vulnerable populations. Despite some progress, climate adaptation remains uneven with significant gaps, and financial flows continue to fall below levels needed for mitigation goals.",
              "What were the key findings of the latest IPCC climate report?"],
              "turns_to_mask": [0, 0],
              "reward_category": "rag"
              }
  ```
+ `rai`: A reward category for tasks that require applying responsible AI principles such as fairness, transparency, and ethics. Used for evaluating potential biases in AI systems, ensuring privacy considerations, addressing ethical dilemmas, and promoting inclusive design principles.

  Here is an example:

  ```
  {
              "turns": ["Identify potential bias concerns when developing a loan approval algorithm and suggest mitigation strategies"],
              "turns_to_mask": [0],
              "reward_category": "rai"
              }
  ```

**Masking turns**  
In training datasets, the `turns_to_mask` parameter is crucial for controlling which conversation turns receive gradient updates during training. This array of indices determines which parts of the dialogue the model should learn to generate versus which parts should be treated as context only. Proper masking ensures the model learns appropriate response patterns while avoiding training on system prompts or user inputs that could degrade performance.

We recommend the following guidance for masking:
+ **Always mask index 0** - System prompts should never receive gradient updates.
+ **Always mask user turns** - Prevent the model from learning to generate user inputs.
+ **Pattern consistency** - Use identical masking patterns for similar conversation structures, such as (0, 1, 3, 5) for multi-turn dialogues.
+ **Selective training** - Mask early bot responses to focus training on improved final responses.
+ **Chain-of-thought preservation** - Only mask system and user turns when training on reasoning sequences.
+ **Quality filtering** - Mask low-quality assistant responses to prevent performance degradation.
+ **Context optimization** - Ensure masked turns don't remove essential context needed for subsequent responses.

The key to effective masking is monitoring training metrics and validation performance to identify whether your masking strategy preserves necessary context while focusing gradient updates on the desired model outputs.

**Enable KL-divergence loss**  
For enabling KL-divergence loss, the anchor server needs to be enabled to compute the divergence of the current policy from the original distribution. The KL loss type needs to be specified, and coefficients need to be a value other than zero. Higher coefficient values help the model not deviate much from the original policy which results in lesser changes to general performance. Lower coefficient values allow larger deviations from previous policy, leading to better performance of target metrics but impacting the general performance.

```
ppo_anchor:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.cm_replicas}
  model:
    global_batch_size: 32

ppo_actor_train:
  model:
    ######## Use KL in actor loss ########
    kl_loss_type: low_var_kl
    kl_loss_coeff: 0.1

    ######## Use KL in reward model ######
    kl_reward_penalty_coeff: 0.1
```

**Learning rate**  
The learning rate for the critic and policy models can be adjusted, with 3e-6 being the default balanced choice. Higher learning rates typically lead to training instabilities, which can be identified through KL divergence spikes and erratic policy behavior. Lower learning rates may cause convergence issues and slow learning, indicated by stagnant rewards and minimal policy updates. Regular monitoring of KL divergence, reward score, and value loss helps in determining whether to adjust the learning rate during training.

```
ppo_critic:
  model:
    optim:
      lr: 3e-6

ppo_actor_train:
  model:
    optim:
      lr: 3e-06
```

**Global batch size**  
Global batch size significantly impacts PPO performance in Amazon Nova, with larger batches generally improving training stability and gradient estimation while enabling more efficient parallel processing. However, very large batch sizes can lead to diminishing returns and may be constrained by available memory, requiring careful balance with learning rate and other hyperparameters.

```
ppo_actor_train:
  model:
    global_batch_size: 160
```

The Amazon Nova parameters that are available for tuning with PPO include:
+ **Run configuration**
  + `actor_train_replicas`: The number of compute instances to be used for the actor train model. Available values vary based on the model chosen. Amazon Nova Micro supports 1 or 2 replicas. Amazon Nova Lite supports 1, 2, or 4 replicas. Amazon Nova Pro supports 3, 6, or 12 replicas.
  + `rm_replicas`: The number of compute instances used for the reward model. We recommend that you use one replica for any model size.
  + `cm_replicas`: The number of compute instances used for the critic model. We recommend that you use one replica for any model size.
  + `actor_generation_replicas`: The number of compute instances used for the actor generation. Available values vary based on the model chosen. Amazon Nova Micro supports 1 replica. Amazon Nova Lite supports 1 or 2 replicas. Amazon Nova Pro supports 1 or 2 replicas.
  + `am_replicas`: The number of compute instances used for the anchor model. We recommend that you use one replica for any model size.
+ **Actor train configuration (policy config)**
  + `max_steps`: The maximum number of steps to fine-tune or train the actor train model. Here, one step is defined as rollout, followed by training the actor train model with `global_batch_size` number of samples. One epoch is defined as `global_batch_size * trajectory_buffer_scale`.

    The value chosen here will vary based on your use case and dataset complexity. We recommend starting with 65 epochs or 520 steps, which is the number of epochs multiplied by the value of the `trajectory_buffer_scale`. However, some tasks require a longer PPO training time to achieve the same performance.

    For PPO, the training metrics, such as saturating reward model score and average action length from the [ml-flow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server.html) console, can help in identifying the optimal points for evaluation.
  + `actor_model_max_length`: The maximum length of the input data that is sent to the actor generation component to generate completions.
  + `reward_model_max_length`: The maximum length of the input data that is sent to the reward server to score completions.
  + `trajectory_buffer_scale`: This buffer represents the number of rollouts generated using the old actor train (policy) model before updating the weights and generating the new rollouts. The supported values are 1, 2, 4, 8, and 16.

    If `trajectory_buffer_scale` is 1, then the training is on policy. That means the rollouts are generated with the most updated model weights, but throughput suffers. If it's 16, then the model is slightly off-policy but throughput is higher. We recommend starting with 8 for each model.
  + `kl_reward_penalty_coeff`: This is the KL divergence term that ensures updates are not too drastic and the policy does not draft from the base or SFT model.
  + `kl_loss_coeff`: This value controls how much the KL divergence penalty influences the overall training objective in PPO.
  + `kl_loss_type`: This value specifies how to compute the divergence between current and reference policy distributions. The `kl_loss_types` available are `kl` (Standard KL divergence), `mse` (Mean squared error), `abs` (Absolute difference between log probabilities), and `low_var_kl` (low-variance KL approximation).
  + `model.clip_ratio`: The actor clip ratio (ε) in PPO is a hyperparameter that limits how much the policy can change during each update.
  + `model.optim.lr`: The learning rate used for surrogate model loss training in the actor model. 
  + `model.lam`: Part of the advantage estimation process. Higher λ gives more weight to longer-term rewards but with higher variance, while a lower λ focuses more on immediate rewards with lower variance but more bias.
  + `model.ent_coeff`: Entropy loss in PPO encourages exploration by penalizing the policy when it becomes too deterministic (that is, always picking the same actions with high confidence).
+ **Reward model configuration**
  + `global_batch_size`: The batch size for scoring the completions using the reward model. If `ppo_actor_train.model.global_batch_size` is greater than `ppo_reward.model.global_batch_size`, they are processed in multiple batches. Note that `ppo_actor_train.model.global_batch_size % ppo_reward.model.global_batch_size` must equal 0.
  + `max_length`: The maximum context length of the reward model. This should be same as `ppo_actor_train.model.max_length`.
+ **Critic model configuration**
  + `global_batch_size`: The batch size of the critic model value. The critic model will provide value estimates for each token in the responses provided by the actor model. The batch size is used for both inference and training.

    Note that `ppo_actor_train.model.global_batch_size % ppo_critic.model.global_batch_size` must equal 0 and `ppo_actor_train.model.global_batch_size * ppo_actor_train.model.trajectory_buffer_size % ppo_critic.model.global_batch_size == 0`.
  + `max_length`: The maximum context length of the critic model. This should be same as `ppo_actor_train.model.max_length`.
  + `model.optim.lr`: The learning rate used for surrogate model loss training in the actor model.
+ **Anchor model configuration**
  + `global_batch_size`: The batch size for generating the logp of the frozen SFT or anchor model. Note that `ppo_actor_train.model.global_batch_size % ppo_anchor.model.global_batch_size` must equal 0.
  + `max_length`: The maximum context length of the reward model. This should be same as `ppo_actor_train.model.max_length`.
+ **Actor generation model configuration**
  + `actor_model_max_length`: The maximum context length of the actor model generation component. This should be the same as `ppo_actor_train.model.max_length`.

**PPO recipe**  
The following is a recipe for PPO.

```
## Run config
run:
  name: ndry-ppo-pro
  model_type: amazon.nova-pro-v1:0:300k
  model_name_or_path: nova-pro/prod
  data_s3_path: s3://testing/train.jsonl # Your training data S3 path

  actor_train_replicas: 6 # Actor train model replicas
  rm_replicas: 1 # Reward model replicas
  cm_replicas: 1 # Critic model replicas
  actor_generation_replicas: 2 # Actor generation model replicas
  am_replicas: 1 # Anchor model replicas

## Training config for each PPO component
ppo_reward:
  max_length: 8192 # model architecture max length
  trainer:
    num_nodes: ${recipes.run.rm_replicas}
  model:
    global_batch_size: 16

ppo_critic:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.cm_replicas}
  model:
    global_batch_size: 16
    optim:
      lr: 3e-6
      name: distributed_fused_adam
      adam_w_mode: true
      eps: 1e-06
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999

ppo_anchor:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.am_replicas}
  model:
    global_batch_size: 16

ppo_actor_generation:
  actor_model_max_length: 8192
  trainer:
    num_nodes: ${recipes.run.actor_generation_replicas}

ppo_actor_train:
  max_length: 8192
  max_steps: 520 # Stopping criteria Desired epoch num * trajectory_buffer_scale
  actor_model_max_length: 8192 # truncate input data to max length
  reward_model_max_length: 8192 # truncate input data to max length
  trajectory_buffer_scale: 8
  trainer:
    num_nodes: ${recipes.run.actor_train_replicas}
  model:
    global_batch_size: 160
    ent_coeff: 0
    clip_ratio: 0.2
    lam: 1
    kl_loss_coeff: 0.0
    kl_loss_type: low_var_kl
    kl_reward_penalty_coeff: 0.0
    hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
    attention_dropout: 0.0 # Dropout probability in the attention layer.
    ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
    optim:
      lr: 3e-06
      name: distributed_fused_adam # only this one is available for p0.
      adam_w_mode: true
      eps: 1e-08
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999
```

**Limitations**  
PPO has the following limitations:
+ Intermediate checkpoints are not saved for evaluation and you can't resume from an intermediate checkpoint. Only the last checkpoint is saved.
+ Multimodal datasets aren't supported.
+ Training jobs aren't automatically stopped. You have to stop the job using the SageMaker HyperPod CLI.
+ Critic training metrics are not supported on TensorBoard.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).