Creating a SageMaker HyperPod cluster
Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.
-
Before creating an SageMaker HyperPod cluster:
-
Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see Create an Amazon EKS cluster in the Amazon EKS User Guide.
-
Install the Helm chart as instructed in Installing packages on the Amazon EKS cluster using Helm. If you create a Creating a HyperPod EKS cluster with restricted instance group (RIG), you will need a separate Helm chart.
-
-
Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as
s3://
.amzn-s3-demo-bucket
/Lifecycle-scripts
/base-config
/For a quick start, download the sample script
on_create.sh
from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage. Important
If you create an IAM role for SageMaker HyperPod attaching only the managed
AmazonSageMakerClusterInstanceRolePolicy
, your cluster has access to Amazon S3 buckets with the specific prefixsagemaker-
.If you create a restricted instance group, you don't need to download and run the lifecycle script. Instead, you need to run
install_rig_dependencies.sh
.The prerequisites to run the
install_rig_dependencies.sh
script include:-
AWS Node (CNI) and CoreDNS should both be enabled. These are standard EKS add-ons that are not managed by the standard SageMaker HyperPod Helm, but can be easily enabled in the EKS console under Add-ons.
-
The standard SageMaker HyperPod Helm chart should be installed before running this script.
The
install_rig_dependencies.sh
script performs the following actions.-
aws-node
(CNI): Newrig-aws-node
Daemonset created; existingaws-node
patched to avoid RIG nodes. -
coredns
: Converted to Daemonset for RIGs to support multi-RIG use and prevent overloading. -
training-operators: Updated with RIG Worker taint tolerations and nodeAffinity favoring non-RIG instances.
-
Elastic Fabric Adapter (EFA): Updated to tolerate RIG worker taint and use correct container images for each Region.
-
-
Prepare a CreateCluster API request file in JSON format. For
ExecutionRole
, provide the ARN of the IAM role you created with the managedAmazonSageMakerClusterInstanceRolePolicy
from the section IAM role for SageMaker HyperPod.Note
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.
// create_cluster.json
{ "ClusterName":"string"
, "InstanceGroups": [{ "InstanceGroupName":"string"
, "InstanceType":"string"
, "InstanceCount":number
, "LifeCycleConfig": { "SourceS3Uri":"s3://amzn-s3-demo-bucket-sagemaker>/<lifecycle-script-directory>/src/"
, "OnCreate":"on_create.sh"
}, "ExecutionRole":"string"
, "ThreadsPerCore":number
, "OnStartDeepHealthChecks": ["InstanceStress", "InstanceConnectivity"
] }], "RestrictedInstanceGroups": [ { "EnvironmentConfig": { "FSxLustreConfig": { "PerUnitStorageThroughput":number
, "SizeInGiB":number
} }, "ExecutionRole":"string"
, "InstanceCount":number
, "InstanceGroupName":"string"
, "InstanceStorageConfigs": [ { ... } ], "InstanceType":"string"
, "OnStartDeepHealthChecks": ["string"
], "OverrideVpcConfig": { "SecurityGroupIds": ["string"
], "Subnets": ["string"
] }, "ScheduledUpdateConfig": { "DeploymentConfig": { "AutoRollbackConfiguration": [ { "AlarmName":"string"
} ], "RollingUpdatePolicy": { "MaximumBatchSize": { "Type":"string"
, "Value":number
}, "RollbackMaximumBatchSize": { "Type":"string"
, "Value":number
} }, "WaitIntervalInSeconds":number
}, "ScheduleExpression":"string"
}, "ThreadsPerCore":number
, "TrainingPlanArn":"string"
} ], "VpcConfig": { "SecurityGroupIds": ["string"
], "Subnets": ["string"
] }, "Tags": [{ "Key":"string"
, "Value":"string"
}], "Orchestrator": { "Eks": { "ClusterArn":"string"
, } }, "NodeRecovery": "Automatic" }Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
-
You can configure up to 20 instance groups under the
InstanceGroups
parameter. -
For
Orchestator.Eks.ClusterArn
, specify the ARN of the EKS cluster you want to use as the orchestrator. -
For
OnStartDeepHealthChecks
, addInstanceStress
andInstanceConnectivity
to enable Deep health checks. -
For
NodeRecovery
, specifyAutomatic
to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent. -
For the
Tags
parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see Tagging AWS Resources User Guide. -
For the
VpcConfig
parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
-
-
Run the create-cluster command as follows.
Important
When running the
create-cluster
command with the--cli-input-json
parameter, you must include thefile://
prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting thefile://
prefix results in a parsing parameter error.aws sagemaker create-cluster \ --cli-input-json
file://complete/path/to/create_cluster.json
This should return the ARN of the new cluster.
Important
You can use the update-cluster operation to remove a restricted instance group (RIG). When a RIG is scaled down to 0, the FSx for Lustre file system won't be deleted. To completely remove the FSx for Lustre file system, you must remove the RIG entirely.
Removing a RIG will not delete any artifacts stored in the service-managed Amazon S3 bucket. However, you should ensure all artifacts in the FSx for Lustre file system are fully synchronized to Amazon S3 before removal. We recommend waiting at least 30 minutes after job completion to ensure complete synchronization of all artifacts from the FSx for Lustre file system to the service-managed Amazon S3 bucket.