Direct Preference Optimization (DPO) - Amazon SageMaker AI

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO)

DPO is an advanced technique that fine-tunes models based on human preferences rather than fixed labels. It uses paired examples where humans have indicated which response is better for a given prompt. The model learns to generate outputs that align with these preferences, helping to improve response quality, reduce harmful outputs, and better align with human values. DPO is particularly valuable for refining model behavior after initial SFT.

For detailed instructions about using DPO with Amazon Nova model customization, see the Direct Preference Optimization (DPO) section from Amazon Nova user guide.