

# Direct Preference Optimization (DPO)
<a name="nova-dpo-smtj"></a>

## Overview
<a name="nova-dpo-smtj-overview"></a>

Direct Preference Optimization (DPO) is an alignment technique that fine-tunes foundation models using paired comparison data to align model outputs with human preferences. Unlike reinforcement learning methods, DPO directly optimizes model behavior based on human feedback about which responses are more desirable, offering a more stable and scalable approach.

**Why use DPO**

Foundation models may generate outputs that are factually correct but fail to align with specific user needs, organizational values, or safety requirements. DPO addresses this by enabling you to:
+ Fine-tune models toward desired behavior patterns
+ Reduce unwanted or harmful outputs
+ Align model responses with brand voice and communication guidelines
+ Improve response quality based on domain expert feedback
+ Implement safety guardrails through preferred response patterns

**How DPO works**

DPO uses paired examples where human evaluators indicate which of two possible responses is preferred. The model learns to maximize the likelihood of generating preferred responses while minimizing undesired ones.

**When to use DPO**

Use DPO in the following scenarios:
+ Optimizing for subjective outputs that require alignment with specific human preferences
+ Adjusting the model's tone, style, or content characteristics
+ Making targeted improvements based on user feedback and error analysis
+ Maintaining consistent output quality across different use cases
+ Training with reward-free reinforcement learning using only preference data

## Supported models and techniques
<a name="nova-dpo-smtj-models"></a>

DPO supports both full-parameter fine-tuning and LoRA (Low-Rank Adaptation):


| Model | Supported inputs | Instance type | Recommended instance count | Allowed instance count | 
| --- | --- | --- | --- | --- | 
| Amazon Nova Micro | Text | ml.p5.48xlarge | 2 | 2, 4, 8 | 
| Amazon Nova Lite | Text, image | ml.p5.48xlarge | 4 | 2, 4, 8, 16 | 
| Amazon Nova Pro | Text, image | ml.p5.48xlarge | 6 | 6, 12, 24 | 

**Training approaches**
+ **Full-rank DPO**: Updates all model parameters. Potentially delivers better alignment quality but requires more compute resources and produces larger models.
+ **LoRA DPO**: Uses lightweight adapters for parameter-efficient fine-tuning. Offers more efficient training and deployment with smaller output models while maintaining good alignment quality.

For most use cases, the LoRA approach provides sufficient adaptation capability with significantly improved efficiency.

## Data format
<a name="nova-dpo-smtj-data"></a>

DPO training data follows the same format as SFT, except the last assistant turn must contain preference pairs with `preferred` and `non-preferred` labels.

### Basic structure
<a name="nova-dpo-smtj-data-structure"></a>

The final assistant turn uses a `candidates` array instead of `content`:

```
{
  "role": "assistant",
  "candidates": [
    {
      "content": [
        {
          "text": "This is the preferred response."
        }
      ],
      "preferenceLabel": "preferred"
    },
    {
      "content": [
        {
          "text": "This is the non-preferred response."
        }
      ],
      "preferenceLabel": "non-preferred"
    }
  ]
}
```

### Complete text example
<a name="nova-dpo-smtj-data-text-example"></a>

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a helpful assistant."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What is the capital of France?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "The capital of France is Paris."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "text": "Tell me more about it."
        }
      ]
    },
    {
      "role": "assistant",
      "candidates": [
        {
          "content": [
            {
              "text": "Paris is the capital and largest city of France, known for the Eiffel Tower, world-class museums like the Louvre, and its rich cultural heritage."
            }
          ],
          "preferenceLabel": "preferred"
        },
        {
          "content": [
            {
              "text": "Paris is a city in France."
            }
          ],
          "preferenceLabel": "non-preferred"
        }
      ]
    }
  ]
}
```

### Example with images
<a name="nova-dpo-smtj-data-image-example"></a>

```
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a helpful assistant."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "Describe this image."
        },
        {
          "image": {
            "format": "jpeg",
            "source": {
              "s3Location": {
                "uri": "s3://your-bucket/your-path/image.jpg",
                "bucketOwner": "your-aws-account-id"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "candidates": [
        {
          "content": [
            {
              "text": "The image shows a detailed description with relevant context and observations."
            }
          ],
          "preferenceLabel": "preferred"
        },
        {
          "content": [
            {
              "text": "This is a picture."
            }
          ],
          "preferenceLabel": "non-preferred"
        }
      ]
    }
  ]
}
```

### Dataset requirements
<a name="nova-dpo-smtj-data-requirements"></a>
+ **Format**: Single JSONL file for training, single JSONL file for validation (optional)
+ **Minimum size**: 1,000 preference pairs recommended for effective training
+ **Quality**: High-quality preference data produces more effective results
+ **Other constraints**: Same as SFT. For more information, see Dataset constraints.

**Uploading data**

```
aws s3 cp /path/to/training-data/ s3://your-bucket/train/ --recursive
aws s3 cp /path/to/validation-data/ s3://your-bucket/val/ --recursive
```

## Recipe configuration
<a name="nova-dpo-smtj-recipe"></a>

### General run configuration
<a name="nova-dpo-smtj-recipe-run"></a>

```
run:
  name: "my-dpo-run"
  model_type: "amazon.nova-lite-v1:0:300k"
  model_name_or_path: "nova-lite/prod"
  replicas: 4
```


| Parameter | Description | 
| --- | --- | 
| name | Descriptive name for your training job | 
| model\$1type | Nova model variant (do not modify) | 
| model\$1name\$1or\$1path | Base model path (do not modify) | 
| replicas | Number of compute instances for distributed training | 

### Training configuration
<a name="nova-dpo-smtj-recipe-training"></a>

```
training_config:
  max_length: 16384
  global_batch_size: 32

  trainer:
    max_epochs: 3

  model:
    hidden_dropout: 0.0
    attention_dropout: 0.0
    ffn_dropout: 0.0
```


| Parameter | Description | Range | 
| --- | --- | --- | 
| max\$1length | Maximum sequence length in tokens | 1024–32768 | 
| global\$1batch\$1size | Samples per optimizer step | Micro/Lite/Pro: 16, 32, 64, 128. Micro/Lite: 256 | 
| max\$1epochs | Training passes through dataset | Min: 1 | 
| hidden\$1dropout | Dropout for hidden states | 0.0–1.0 | 
| attention\$1dropout | Dropout for attention weights | 0.0–1.0 | 
| ffn\$1dropout | Dropout for feed-forward layers | 0.0–1.0 | 

### Optimizer configuration
<a name="nova-dpo-smtj-recipe-optimizer"></a>

```
model:
  optim:
    lr: 1e-5
    name: distributed_fused_adam
    adam_w_mode: true
    eps: 1e-08
    weight_decay: 0.0
    betas:
      - 0.9
      - 0.999
    sched:
      warmup_steps: 10
      constant_steps: 0
      min_lr: 1e-6
```


| Parameter | Description | Range | 
| --- | --- | --- | 
| lr | Learning rate | 0–1 (typically 1e-6 to 1e-4) | 
| weight\$1decay | L2 regularization strength | 0.0–1.0 | 
| warmup\$1steps | Steps to gradually increase learning rate | 0–20 | 
| min\$1lr | Minimum learning rate at end of decay | 0–1 (must be < lr) | 

### DPO-specific configuration
<a name="nova-dpo-smtj-recipe-dpo"></a>

```
model:
  dpo_cfg:
    beta: 0.1
```


| Parameter | Description | Range | 
| --- | --- | --- | 
| beta | Balance between fitting training data and staying close to original model | 0.001–0.5 | 
+ **Higher beta (0.1)**: Preserves more reference model behavior but may learn preferences more slowly
+ **Lower beta (0.01–0.05)**: More aggressive preference learning but risks divergence from reference

**Recommendation**: Start with `beta: 0.1` and adjust downward if preference learning seems insufficient.

### LoRA PEFT configuration
<a name="nova-dpo-smtj-recipe-lora"></a>

```
model:
  peft:
    peft_scheme: "lora"
    lora_tuning:
      loraplus_lr_ratio: 64.0
      alpha: 32
      adapter_dropout: 0.01
```


| Parameter | Description | Allowed values | 
| --- | --- | --- | 
| peft\$1scheme | Fine-tuning method | "lora" or null (full-rank) | 
| alpha | Scaling factor for LoRA weights | 32, 64, 96, 128, 160, 192 | 
| loraplus\$1lr\$1ratio | LoRA\$1 learning rate scaling factor | 0.0–100.0 | 
| adapter\$1dropout | Regularization for LoRA parameters | 0.0–1.0 | 

## Starting a training job
<a name="nova-dpo-smtj-start"></a>

**Container image**

```
708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest
```

**Example code**

```
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

instance_type = "ml.p5.48xlarge"
instance_count = 4

image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest"

recipe_overrides = {
    "training_config": {
        "trainer": {"max_epochs": 2},
        "model": {
            "dpo_cfg": {"beta": 0.1},
            "peft": {
                "peft_scheme": "lora",
                "lora_tuning": {
                    "loraplus_lr_ratio": 64.0,
                    "alpha": 32,
                    "adapter_dropout": 0.01,
                },
            },
        },
    },
}

estimator = PyTorch(
    output_path=f"s3://{bucket_name}/{job_name}",
    base_job_name=job_name,
    role=role,
    instance_count=instance_count,
    instance_type=instance_type,
    training_recipe="fine-tuning/nova/nova_lite_p5_gpu_lora_dpo",
    recipe_overrides=recipe_overrides,
    max_run=18000,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    disable_profiler=True,
    debugger_hook_config=False,
)

train_input = TrainingInput(
    s3_data=train_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

val_input = TrainingInput(
    s3_data=val_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

estimator.fit(inputs={"train": train_input, "validation": val_input}, wait=True)
```

## Deploying the model
<a name="nova-dpo-smtj-deploy"></a>

After training completes, deploy the customized model to Amazon Bedrock using the Custom Model Import functionality. The model supports both provisioned throughput and on-demand inference. LoRA-trained models support on-demand inference.

For deployment instructions, see [Deploying customized models](deploy-custom-model.md).

## Limitations
<a name="nova-dpo-smtj-limitations"></a>
+ **Input modalities**: DPO accepts text and images only. Video input is not supported.
+ **Output modality**: Text only
+ **Preference pairs**: The final assistant turn must contain exactly two candidates with `preferred` and `non-preferred` labels
+ **Image limit**: Maximum 10 images per content block
+ **Mixed modalities**: Cannot combine text, image, and video in the same training job