Proximal Policy Optimization (PPO) - Amazon SageMaker AI

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is an advanced technique that employs multiple machine learning models working together to train and improve a language model. The PPO process involves five key components:

  • The Actor Train Model (or policy model) is a supervised fine-tuned model that undergoes continuous updates during each training epoch. These updates are carefully controlled using a clipped-surrogate objective that limits how much the model can change at each step, ensuring training stability by keeping policy updates "proximal" to previous versions.

  • The Actor Generation Model produces responses to prompts that are then evaluated by other models in the system. This model's weights are synchronized with the Actor Train Model at the beginning of each epoch.

  • The Reward Model has fixed (frozen) weights and assigns scores to the outputs created by the Actor Generation Model, providing feedback on response quality.

  • The Critic Model has trainable weights and evaluates the Actor Generation Model's outputs, estimating the total reward the actor might receive for generating the remaining tokens in a sequence.

  • The Anchor Model is a frozen supervised fine-tuned model that helps calculate the Kullback-Leibler (KL) divergence between the Actor Train Model and the original base model. This component prevents the Actor Train Model from deviating too drastically from the base model's behavior, which could cause instability or performance issues.

Together, these components create a sophisticated reinforcement learning system that can optimize language model outputs based on defined reward criteria while maintaining stable training dynamics.

For detailed instructions about using PPO with Amazon Nova model customization, see the Proximal Policy Optimization (PPO) section from Amazon Nova user guide.