Overview Supported models and techniques Data format Recipe configuration Starting a training job Deploying the model Limitations

Direct Preference Optimization (DPO)

Overview

Direct Preference Optimization (DPO) is an alignment technique that fine-tunes foundation models using paired comparison data to align model outputs with human preferences. Unlike reinforcement learning methods, DPO directly optimizes model behavior based on human feedback about which responses are more desirable, offering a more stable and scalable approach.

Why use DPO

Foundation models may generate outputs that are factually correct but fail to align with specific user needs, organizational values, or safety requirements. DPO addresses this by enabling you to:

Fine-tune models toward desired behavior patterns
Reduce unwanted or harmful outputs
Align model responses with brand voice and communication guidelines
Improve response quality based on domain expert feedback
Implement safety guardrails through preferred response patterns

How DPO works

DPO uses paired examples where human evaluators indicate which of two possible responses is preferred. The model learns to maximize the likelihood of generating preferred responses while minimizing undesired ones.

When to use DPO

Use DPO in the following scenarios:

Optimizing for subjective outputs that require alignment with specific human preferences
Adjusting the model's tone, style, or content characteristics
Making targeted improvements based on user feedback and error analysis
Maintaining consistent output quality across different use cases
Training with reward-free reinforcement learning using only preference data

Supported models and techniques

DPO supports both full-parameter fine-tuning and LoRA (Low-Rank Adaptation):

Model	Supported inputs	Instance type	Recommended instance count	Allowed instance count
Amazon Nova Micro	Text	ml.p5.48xlarge	2	2, 4, 8
Amazon Nova Lite	Text, image	ml.p5.48xlarge	4	2, 4, 8, 16
Amazon Nova Pro	Text, image	ml.p5.48xlarge	6	6, 12, 24

Training approaches

Full-rank DPO: Updates all model parameters. Potentially delivers better alignment quality but requires more compute resources and produces larger models.
LoRA DPO: Uses lightweight adapters for parameter-efficient fine-tuning. Offers more efficient training and deployment with smaller output models while maintaining good alignment quality.

For most use cases, the LoRA approach provides sufficient adaptation capability with significantly improved efficiency.

Data format

DPO training data follows the same format as SFT, except the last assistant turn must contain preference pairs with preferred and non-preferred labels.

Basic structure

The final assistant turn uses a candidates array instead of content:


{
  "role": "assistant",
  "candidates": [
    {
      "content": [
        {
          "text": "This is the preferred response."
        }
      ],
      "preferenceLabel": "preferred"
    },
    {
      "content": [
        {
          "text": "This is the non-preferred response."
        }
      ],
      "preferenceLabel": "non-preferred"
    }
  ]
}

Complete text example


{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a helpful assistant."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What is the capital of France?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "The capital of France is Paris."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "text": "Tell me more about it."
        }
      ]
    },
    {
      "role": "assistant",
      "candidates": [
        {
          "content": [
            {
              "text": "Paris is the capital and largest city of France, known for the Eiffel Tower, world-class museums like the Louvre, and its rich cultural heritage."
            }
          ],
          "preferenceLabel": "preferred"
        },
        {
          "content": [
            {
              "text": "Paris is a city in France."
            }
          ],
          "preferenceLabel": "non-preferred"
        }
      ]
    }
  ]
}

Example with images


{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a helpful assistant."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "Describe this image."
        },
        {
          "image": {
            "format": "jpeg",
            "source": {
              "s3Location": {
                "uri": "s3://your-bucket/your-path/image.jpg",
                "bucketOwner": "your-aws-account-id"
              }
            }
          }
        }
      ]
    },
    {
      "role": "assistant",
      "candidates": [
        {
          "content": [
            {
              "text": "The image shows a detailed description with relevant context and observations."
            }
          ],
          "preferenceLabel": "preferred"
        },
        {
          "content": [
            {
              "text": "This is a picture."
            }
          ],
          "preferenceLabel": "non-preferred"
        }
      ]
    }
  ]
}

Dataset requirements

Format: Single JSONL file for training, single JSONL file for validation (optional)
Minimum size: 1,000 preference pairs recommended for effective training
Quality: High-quality preference data produces more effective results
Other constraints: Same as SFT. For more information, see Dataset constraints.

Uploading data


aws s3 cp /path/to/training-data/ s3://your-bucket/train/ --recursive
aws s3 cp /path/to/validation-data/ s3://your-bucket/val/ --recursive

Recipe configuration

General run configuration


run:
  name: "my-dpo-run"
  model_type: "amazon.nova-lite-v1:0:300k"
  model_name_or_path: "nova-lite/prod"
  replicas: 4

Parameter	Description
`name`	Descriptive name for your training job
`model_type`	Nova model variant (do not modify)
`model_name_or_path`	Base model path (do not modify)
`replicas`	Number of compute instances for distributed training

Training configuration


training_config:
  max_length: 16384
  global_batch_size: 32

  trainer:
    max_epochs: 3

  model:
    hidden_dropout: 0.0
    attention_dropout: 0.0
    ffn_dropout: 0.0

Parameter	Description	Range
`max_length`	Maximum sequence length in tokens	1024–32768
`global_batch_size`	Samples per optimizer step	Micro/Lite/Pro: 16, 32, 64, 128. Micro/Lite: 256
`max_epochs`	Training passes through dataset	Min: 1
`hidden_dropout`	Dropout for hidden states	0.0–1.0
`attention_dropout`	Dropout for attention weights	0.0–1.0
`ffn_dropout`	Dropout for feed-forward layers	0.0–1.0

Optimizer configuration


model:
  optim:
    lr: 1e-5
    name: distributed_fused_adam
    adam_w_mode: true
    eps: 1e-08
    weight_decay: 0.0
    betas:
      - 0.9
      - 0.999
    sched:
      warmup_steps: 10
      constant_steps: 0
      min_lr: 1e-6

Parameter	Description	Range
`lr`	Learning rate	0–1 (typically 1e-6 to 1e-4)
`weight_decay`	L2 regularization strength	0.0–1.0
`warmup_steps`	Steps to gradually increase learning rate	0–20
`min_lr`	Minimum learning rate at end of decay	0–1 (must be < lr)

DPO-specific configuration


model:
  dpo_cfg:
    beta: 0.1

Parameter	Description	Range
`beta`	Balance between fitting training data and staying close to original model	0.001–0.5

Higher beta (0.1): Preserves more reference model behavior but may learn preferences more slowly
Lower beta (0.01–0.05): More aggressive preference learning but risks divergence from reference

Recommendation: Start with beta: 0.1 and adjust downward if preference learning seems insufficient.

LoRA PEFT configuration


model:
  peft:
    peft_scheme: "lora"
    lora_tuning:
      loraplus_lr_ratio: 64.0
      alpha: 32
      adapter_dropout: 0.01

Parameter	Description	Allowed values
`peft_scheme`	Fine-tuning method	`"lora"` or `null` (full-rank)
`alpha`	Scaling factor for LoRA weights	32, 64, 96, 128, 160, 192
`loraplus_lr_ratio`	LoRA+ learning rate scaling factor	0.0–100.0
`adapter_dropout`	Regularization for LoRA parameters	0.0–1.0

Starting a training job

Container image


708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest

Example code


from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

instance_type = "ml.p5.48xlarge"
instance_count = 4

image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest"

recipe_overrides = {
    "training_config": {
        "trainer": {"max_epochs": 2},
        "model": {
            "dpo_cfg": {"beta": 0.1},
            "peft": {
                "peft_scheme": "lora",
                "lora_tuning": {
                    "loraplus_lr_ratio": 64.0,
                    "alpha": 32,
                    "adapter_dropout": 0.01,
                },
            },
        },
    },
}

estimator = PyTorch(
    output_path=f"s3://{bucket_name}/{job_name}",
    base_job_name=job_name,
    role=role,
    instance_count=instance_count,
    instance_type=instance_type,
    training_recipe="fine-tuning/nova/nova_lite_p5_gpu_lora_dpo",
    recipe_overrides=recipe_overrides,
    max_run=18000,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    disable_profiler=True,
    debugger_hook_config=False,
)

train_input = TrainingInput(
    s3_data=train_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

val_input = TrainingInput(
    s3_data=val_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

estimator.fit(inputs={"train": train_input, "validation": val_input}, wait=True)

Deploying the model

After training completes, deploy the customized model to Amazon Bedrock using the Custom Model Import functionality. The model supports both provisioned throughput and on-demand inference. LoRA-trained models support on-demand inference.

For deployment instructions, see Deploying customized models.

Limitations

Input modalities: DPO accepts text and images only. Video input is not supported.
Output modality: Text only
Preference pairs: The final assistant turn must contain exactly two candidates with preferred and non-preferred labels
Image limit: Maximum 10 images per content block
Mixed modalities: Cannot combine text, image, and video in the same training job

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Distillation

Monitoring Progress Across Iterations