Using the HyperPod training operator
The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly.
The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.
While Kueue is not required for this training operator, your cluster administrator can install and
configure it for enhanced job scheduling capabilities. For more information, see the official
documentation for Kueue
Note
To use the training operator, you must use the latest HyperPod AMI release. To upgrade, use the UpdateClusterSoftware API operation. If you use HyperPod task governance, it must also be the latest version.
Supported versions
The HyperPod training operator works only work with specific versions of Kubernetes, Kueue, and HyperPod. See the list below for the complete list of compatible versions.
-
Supported Kubernetes versions – 1.28, 1.29, 1.30, 1.31, or 1.32
-
The latest HyperPod AMI release. To upgrade to the latest AMI release, use the UpdateClusterSoftware API.