Getting started with SageMaker HyperPod using the AWS CLI
Create your first SageMaker HyperPod cluster using the AWS CLI commands for HyperPod.
Create your first SageMaker HyperPod cluster with Slurm
The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and
set it up with Slurm through the AWS CLI
commands for SageMaker HyperPod. Following the tutorial, you'll create a
HyperPod cluster with three Slurm nodes: my-controller-group,
my-login-group, and worker-group-1.
With the API-driven configuration approach, you define Slurm node types and
partition assignments directly in the CreateCluster API request using
SlurmConfig. This eliminates the need for a separate
provisioning_parameters.json file and provides built-in validation,
drift detection, and per-instance-group FSx configuration.
-
First, prepare and upload lifecycle scripts to an Amazon S3 bucket. During cluster creation, HyperPod runs them in each instance group. Upload lifecycle scripts to Amazon S3 using the following command.
aws s3 sync \ ~/local-dir-to-lifecycle-scripts/* \ s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/srcNote
The S3 bucket path should start with a prefix
sagemaker-, because the IAM role for SageMaker HyperPod withAmazonSageMakerClusterInstanceRolePolicyonly allows access to Amazon S3 buckets that starts with the specific prefix.If you are starting from scratch, use sample lifecycle scripts provided in the Awsome Distributed Training GitHub repository
. The following sub-steps show how to download and upload the sample lifecycle scripts to an Amazon S3 bucket. -
Download a copy of the lifecycle script samples to a directory on your local computer.
git clone https://github.com/aws-samples/awsome-distributed-training/ -
Go into the directory
1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config, where you can find a set of lifecycle scripts. cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-configTo learn more about the lifecycle script samples, see Customizing SageMaker HyperPod clusters using lifecycle scripts.
-
Upload the scripts to
s3://sagemaker-. You can do so by using the Amazon S3 console, or by running the following AWS CLI Amazon S3 command.<unique-s3-bucket-name>/<lifecycle-script-directory>/srcaws s3 sync \ ~/local-dir-to-lifecycle-scripts/* \ s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
Note
With API-driven configuration, you do not need to create or upload a
provisioning_parameters.jsonfile. The Slurm configuration is defined directly in the CreateCluster API request in the next step. -
-
Prepare a CreateCluster request file in JSON format and save as
create_cluster.json.With API-driven configuration, you specify the Slurm node type and partition assignment for each instance group using the
SlurmConfigfield. You also configure the cluster-level Slurm settings usingOrchestrator.Slurm.For
ExecutionRole, provide the ARN of the IAM role you created with the managedAmazonSageMakerClusterInstanceRolePolicyin Prerequisites for using SageMaker HyperPod.{ "ClusterName": "my-hyperpod-cluster", "InstanceGroups": [ { "InstanceGroupName": "my-controller-group", "InstanceType": "ml.c5.xlarge", "InstanceCount": 1, "SlurmConfig": { "NodeType": "Controller" }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole", "InstanceStorageConfigs": [ { "EbsVolumeConfig": { "VolumeSizeInGB": 500 } } ] }, { "InstanceGroupName": "my-login-group", "InstanceType": "ml.m5.4xlarge", "InstanceCount": 1, "SlurmConfig": { "NodeType": "Login" }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole" }, { "InstanceGroupName": "worker-group-1", "InstanceType": "ml.trn1.32xlarge", "InstanceCount": 1, "SlurmConfig": { "NodeType": "Compute", "PartitionNames": ["partition-1"] }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole" } ], "Orchestrator": { "Slurm": { "SlurmConfigStrategy": "Managed" } } }SlurmConfig fields:
Field Description NodeTypeThe Slurm role for the instance group. Valid values: Controller,Login,ComputePartitionNamesThe Slurm partition(s) to assign compute nodes to. Only valid for Computenode type.Orchestrator.Slurm fields:
Field Description SlurmConfigStrategyControls how HyperPod manages slurm.conf. Valid values:Managed(default),Overwrite,MergeSlurmConfigStrategy options:
-
Managed(recommended): HyperPod fully managesslurm.confand detects unauthorized changes (drift detection). Updates fail if drift is detected. -
Overwrite: HyperPod overwritesslurm.confon updates, ignoring any manual changes. -
Merge: HyperPod preserves manual changes and merges them with API configuration.
Adding FSx for Lustre (optional):
To mount an FSx for Lustre filesystem to your compute nodes, add
FsxLustreConfigto theInstanceStorageConfigsfor the instance group. This requires a Custom VPC configuration.{ "InstanceGroupName": "worker-group-1", "InstanceType": "ml.trn1.32xlarge", "InstanceCount": 1, "SlurmConfig": { "NodeType": "Compute", "PartitionNames": ["partition-1"] }, "InstanceStorageConfigs": [ { "FsxLustreConfig": { "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com", "MountPath": "/fsx", "MountName": "abcdefgh" } } ], "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole" }Adding FSx for OpenZFS (optional):
You can also mount FSx for OpenZFS filesystems:
"InstanceStorageConfigs": [ { "FsxOpenZfsConfig": { "DnsName": "fs-0xyz789abc123456.fsx.us-west-2.amazonaws.com", "MountPath": "/shared" } } ]Note
Each instance group can have at most one FSx for Lustre and one FSx for OpenZFS configuration. Different instance groups can mount different filesystems.
Adding VPC configuration (required for FSx):
If using FSx, you must specify a Custom VPC configuration:
{ "ClusterName": "my-hyperpod-cluster", "InstanceGroups": [ { "InstanceGroupName": "my-controller-group", "InstanceType": "ml.c5.xlarge", "InstanceCount": 1, "SlurmConfig": { "NodeType": "Controller" }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole" }, ], "Orchestrator": { "Slurm": { "SlurmConfigStrategy": "Managed" } }, "VpcConfig": { "SecurityGroupIds": ["sg-0abc123def456789a"], "Subnets": ["subnet-0abc123def456789a"] } } -
-
Run the following command to create the cluster.
aws sagemaker create-cluster --cli-input-jsonfile://complete/path/to/create_cluster.jsonThis should return the ARN of the created cluster.
{ "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster" }If you receive an error due to resource limits, ensure that you change the instance type to one with sufficient quotas in your account, or request additional quotas by following SageMaker HyperPod quotas.
Common validation errors:
Error Resolution "Cluster must have exactly one InstanceGroup with Controller node type" Ensure exactly one instance group has SlurmConfig.NodeType:"Controller""Partitions can only be assigned to Compute node types" Remove PartitionNamesfromControllerorLogininstance groups"FSx configurations are only supported for Custom VPC" Add VpcConfigto your request when using FSx -
Run
describe-clusterto check the status of the cluster.aws sagemaker describe-cluster --cluster-namemy-hyperpod-clusterExample response:
{ "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster", "ClusterName": "my-hyperpod-cluster", "ClusterStatus": "Creating", "InstanceGroups": [ { "InstanceGroupName": "my-controller-group", "InstanceType": "ml.c5.xlarge", "InstanceCount": 1, "CurrentCount": 0, "TargetCount": 1, "SlurmConfig": { "NodeType": "Controller" }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<bucket>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole" }, { "InstanceGroupName": "my-login-group", "InstanceType": "ml.m5.4xlarge", "InstanceCount": 1, "CurrentCount": 0, "TargetCount": 1, "SlurmConfig": { "NodeType": "Login" }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<bucket>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole" }, { "InstanceGroupName": "worker-group-1", "InstanceType": "ml.trn1.32xlarge", "InstanceCount": 1, "CurrentCount": 0, "TargetCount": 1, "SlurmConfig": { "NodeType": "Compute", "PartitionNames": ["partition-1"] }, "LifeCycleConfig": { "SourceS3Uri": "s3://sagemaker-<bucket>/src", "OnCreate": "on_create.sh" }, "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole" } ], "Orchestrator": { "Slurm": { "SlurmConfigStrategy": "Managed" } }, "CreationTime": "2024-01-15T10:30:00Z" }After the status of the cluster turns to
InService, proceed to the next step. Cluster creation typically takes 10-15 minutes. -
Run
list-cluster-nodesto check the details of the cluster nodes.aws sagemaker list-cluster-nodes --cluster-namemy-hyperpod-clusterExample response:
{ "ClusterNodeSummaries": [ { "InstanceGroupName": "my-controller-group", "InstanceId": "i-0abc123def456789a", "InstanceType": "ml.c5.xlarge", "InstanceStatus": { "Status": "Running", "Message": "" }, "LaunchTime": "2024-01-15T10:35:00Z" }, { "InstanceGroupName": "my-login-group", "InstanceId": "i-0abc123def456789b", "InstanceType": "ml.m5.4xlarge", "InstanceStatus": { "Status": "Running", "Message": "" }, "LaunchTime": "2024-01-15T10:35:00Z" }, { "InstanceGroupName": "worker-group-1", "InstanceId": "i-0abc123def456789c", "InstanceType": "ml.trn1.32xlarge", "InstanceStatus": { "Status": "Running", "Message": "" }, "LaunchTime": "2024-01-15T10:36:00Z" } ] }The
InstanceIdis what your cluster users need for logging (aws ssm) into them. For more information about logging into the cluster nodes and running ML workloads, see Jobs on SageMaker HyperPod clusters. -
Connect to your cluster using AWS Systems Manager Session Manager.
aws ssm start-session \ --target sagemaker-cluster:my-hyperpod-cluster_my-login-group-i-0abc123def456789b\ --regionus-west-2Once connected, verify Slurm is configured correctly:
# Check Slurm nodes sinfo # Check Slurm partitions sinfo -p partition-1 # Submit a test job srun -p partition-1 --nodes=1 hostname
Delete the cluster and clean resources
After you have successfully tested creating a SageMaker HyperPod cluster, it continues
running in the InService state until you delete the cluster. We
recommend that you delete any clusters created using on-demand SageMaker AI capacity when
not in use to avoid incurring continued service charges based on on-demand pricing.
In this tutorial, you have created a cluster that consists of three instance groups.
Make sure you delete the cluster by running the following command.
aws sagemaker delete-cluster --cluster-namemy-hyperpod-cluster
To clean up the lifecycle scripts from the Amazon S3 bucket used for this tutorial, go to the Amazon S3 bucket you used during cluster creation and remove the files entirely.
aws s3 rm s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src--recursive
If you have tested running any model training workloads on the cluster, also check if you have uploaded any data or if your job has saved any artifacts to different Amazon S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent incurring charges, delete all artifacts and data from the storage or file system.