Installing the training operator
See the following sections to learn about how to install the training operator.
Prerequisites
Before you use the HyperPod training operator, you must have completed the following prerequisites:
-
Installed the latest AMI on your HyperPod cluster. For more information, see SageMaker HyperPod AMI releases for Amazon EKS.
-
Set up the EKS Pod Identity Agent using the console. If you want to use the AWS CLI, use the following command:
aws eks create-addon \ --cluster-name
my-eks-cluster
\ --addon-name eks-pod-identity-agent \ --regionAWS Region
Installing the training operator
You can now install the HyperPod training operator through the SageMaker AI console, the Amazon EKS console, or with the AWS CLI The console methods offer simplified experiences that help you install the operator. The AWS CLI offers a programmatic approach that lets you customize more of your installation.
Between the two console experiences, SageMaker AI provides a one-click installation creates the IAM execution role, creates the pod identity association, and installs the operator. The Amazon EKS console installation is similar, but this method doesn't automatically create the IAM execution role. During this process, you can choose to create a new IAM execution role with information that the console pre-populates. By default, these created roles only have access to the current cluster that you're installing the operator in. Unless you edit the role's permissions to include other clusters, if you remove and reinstall the operator, you must create a new role.
The training operator comes with a number of options with default values that might fit your use case. We recommend that you try out the training operator with default values before changing them. The table below describes all parameters and examples of when you might want to configure each parameter.
Parameter | Description | Default |
---|---|---|
hpTrainingControllerManager.manager.resources.requests.cpu | How many processors to allocate for the controller | 1 |
hpTrainingControllerManager.manager.resources.requests.memory | How much memory to allocate to the controller | 2Gi |
hpTrainingControllerManager.manager.resources.limits.cpu | The CPU limit for the controller | 2 |
hpTrainingControllerManager.manager.resources.limits.memory | The memory limit for the controller | 4Gi |
hpTrainingControllerManager.nodeSelector | Node selector for the controller pods | Default behavior is to select nodes with the label sagemaker.amazonaws.com/compute-type: "HyperPod" |
HyperPod elastic agent
The HyperPod elastic agent is an extension of PyTorch’s ElasticAgenthyperpodrun
to create the job launcher.
RUN pip install hyperpod-elastic-agent ENTRYPOINT ["entrypoint.sh"] # entrypoint.sh ... hyperpodrun --nnodes=
node_count
--nproc-per-node=proc_count
\ --rdzv-backend hyperpod \ # Optional ... # Other torchrun args # pre-traing arg_group --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \ # post-train arg_group --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \training.py
--script-args
You can now submit jobs with kubectl
.
HyperPod elastic agent arguments
The HyperPod elastic agent supports all of the original arguments and adds some additional arguments.
The following is all of the arguments available in the HyperPod elastic agent. For more information about
PyTorch's Elastic Agent, see their official documentation
Argument | Description | Default Value |
---|---|---|
--shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" |
--shutdown-timeout | Timeout in seconds between SIGTERM and SIGKILL signals | 30 |
--server-host | Agent server address | "0.0.0.0" |
--server-port | Agent server port | 8080 |
--server-log-level | Agent server log level | "info" |
--server-shutdown-timeout | Server shutdown timeout in seconds | 300 |
--pre-train-script | Path to pre-training script | None |
--pre-train-args | Arguments for pre-training script | None |
--post-train-script | Path to post-training script | None |
--post-train-args | Arguments for post-training script | None |
Task governance (optional)
The training operator is integrated with HyperPod task governance, a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. To set up HyperPod task governance, see Setup for SageMaker HyperPod task governance.
Note
When installing the HyperPod task governance add-on, you must use version v1.3.0-eksbuild.1 or higher.
When submitting a job, make sure you include your queue name and priority class labels of
hyperpod-ns-
and team-name
-localqueue
. For example, if you're using Kueue,
your labels become the following:priority-class
-name-priority
-
kueue.x-k8s.io/queue-name: hyperpod-ns-
team-name
-localqueue -
kueue.x-k8s.io/priority-class:
priority-class
-name-priority
The following is an example of what your configuration file might look like:
apiVersion: sagemaker.amazonaws.com/v1 kind: HyperPodPytorchJob metadata: name: hp-task-governance-sample namespace: hyperpod-ns-
team-name
labels: kueue.x-k8s.io/queue-name: hyperpod-ns-team-name
-localqueue kueue.x-k8s.io/priority-class:priority-class
-priority spec: nprocPerNode: "1" runPolicy: cleanPodPolicy: "None" replicaSpecs: - name: pods replicas: 4 spares: 2 template: spec: containers: - name: ptjob image: XXXX imagePullPolicy: Always ports: - containerPort: 8080 resources: requests: cpu: "2"
Then use the following kubectl command to apply the YAML file.
kubectl apply -f task-governance-job.yaml
Kueue (optional)
While you can run jobs directly, your organization can also integrate the training operator with Kueue to allocate resources and schedule jobs. Follow the steps below to install Kueue into your HyperPod cluster.
-
Follow the installation guide in the official Kueue documentation
. When you reach the step of configuring controller_manager_config.yaml
, add the following configuration:externalFrameworks: - "HyperPodPytorchJob.v1.sagemaker.amazonaws.com"
-
Follow the rest of the steps in the official installation guide. After you finish installing Kueue, you can create some sample queues with the
kubectl apply -f sample-queues.yaml
command. Use the following YAML file.apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: cluster-queue spec: namespaceSelector: {} preemption: withinClusterQueue: LowerPriority resourceGroups: - coveredResources: - cpu - nvidia.com/gpu - pods flavors: - name: default-flavor resources: - name: cpu nominalQuota: 16 - name: nvidia.com/gpu nominalQuota: 16 - name: pods nominalQuota: 16 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: user-queue namespace: default spec: clusterQueue: cluster-queue --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor --- apiVersion: kueue.x-k8s.io/v1beta1 description: High priority kind: WorkloadPriorityClass metadata: name: high-priority-class value: 1000 --- apiVersion: kueue.x-k8s.io/v1beta1 description: Low Priority kind: WorkloadPriorityClass metadata: name: low-priority-class value: 500