Getting started with SageMaker HyperPod using the SageMaker AI console
The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set
it up with Slurm through the SageMaker AI console UI. Following the tutorial, you'll create a
HyperPod cluster with three Slurm nodes, my-controller-group,
my-login-group, and worker-group-1.
Note
HyperPod now supports creating Slurm clusters without lifecycle scripts. You can create a fully functional cluster using AMI-based configuration, extend it with an extension script, or continue using custom lifecycle scripts for full control.
Create cluster
To navigate to the SageMaker HyperPod Clusters page and choose Slurm orchestration, follow these steps.
Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/
. -
Choose HyperPod Clusters in the left navigation pane and then Cluster Management.
-
On the SageMaker HyperPod Clusters page, choose Create HyperPod cluster.
-
On the Create HyperPod cluster drop-down, choose Orchestrated by Slurm.
-
On the Slurm cluster creation page, you will see two options. Choose the option that best fits your needs.
-
Quick setup - To get started immediately with default settings, choose Quick setup. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.
-
Custom setup - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose Custom setup. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.
-
On the Quick setup section, follow these steps to create your HyperPod cluster with Slurm orchestration.
General settings
Specify a name for the new cluster. You can’t change the name after the cluster is created.
Instance groups
To add an instance group, choose Add group. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group for Controller and Compute group types.
Important
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.
Follow these steps to add an instance group.
-
For Instance group type, choose a type for your instance group. For this tutorial, choose Controller (head) for
my-controller-group, Login formy-login-group, and Compute (worker) forworker-group-1. -
For Name, specify a name for the instance group. For this tutorial, create three instance groups named
my-controller-group,my-login-group, andworker-group-1. -
For Instance capacity, choose either on-demand capacity or a training plan to reserve your compute resources.
-
For Instance type, choose the instance for the instance group. For this tutorial, select
ml.c5.xlargeformy-controller-group,ml.m5.4xlargeformy-login-group, andml.trn1.32xlargeforworker-group-1.Important
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see SageMaker HyperPod quotas.
-
For Instance quantity, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter 1 for all three groups.
-
For Target Availability Zone, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.
-
For Additional storage volume per instance (GB) - optional, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is
/opt/sagemaker. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running thedf -hcommand. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the Amazon EBS volumes section in the Amazon Elastic Block Store User Guide. -
Choose Add instance group.
Quick setup defaults
This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.
Note
Quick setup uses default lifecycle scripts automatically. The new AMI-based configuration option (no lifecycle scripts) is available only through Custom setup. If you want to create a cluster without lifecycle scripts, choose Custom setup and choose None under Lifecycle scripts.
On the Custom setup section, follow these steps to create your HyperPod cluster with Slurm orchestration.
General settings
Specify a name for the new cluster. You can’t change the name after the cluster is created.
For Instance recovery, choose Automatic - recommended or None.
Networking
Configure your network settings for the cluster creation. These settings can't be changed after the cluster is created.
-
For VPC, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at Create a VPC in the Amazon Virtual Private Cloud User Guide. You can leave it as None to use the default SageMaker AI VPC.
-
For VPC IPv4 CIDR block, enter the starting IP of your VPC.
-
For Availability Zones, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.
-
For Security groups, create a security group or choose up to five security groups configured with rules to allow inter-resource communication within the VPC.
Instance groups
To add an instance group, choose Add group. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group.
Important
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.
Follow these steps to add an instance group.
-
For Instance group type, choose a type for your instance group. For this tutorial, choose Controller (head) for
my-controller-group, Login formy-login-group, and Compute (worker) forworker-group-1. -
For Name, specify a name for the instance group. For this tutorial, create three instance groups named
my-controller-group,my-login-group, andworker-group-1. -
For Instance capacity, choose either on-demand capacity or a training plan to reserve your compute resources.
-
For Instance type, choose the instance for the instance group. For this tutorial, select
ml.c5.xlargeformy-controller-group,ml.m5.4xlargeformy-login-group, andml.trn1.32xlargeforworker-group-1.Important
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see SageMaker HyperPod quotas.
-
For Instance quantity, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter 1 for all three groups.
-
For Target Availability Zone, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.
-
For Additional storage volume per instance (GB) - optional, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is
/opt/sagemaker. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running thedf -hcommand. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the Amazon EBS volumes section in the Amazon Elastic Block Store User Guide. -
For Slurm partition name (Compute groups only), enter the Slurm partition name for this compute instance group. Partitions act as logical queues that organize how jobs are scheduled across different sets of nodes.
-
Choose Add instance group.
Lifecycle configuration - optional
Configure how nodes in your cluster are provisioned. Your choice affects Amazon S3 bucket requirements, internet access needs, and provisioning complexity. HyperPod supports three node lifecycle configuration options, each offering a different level of control over the provisioning process.
-
For Lifecycle scripts, choose one of the following options to control how nodes are provisioned in your cluster:
-
None — HyperPod configures nodes automatically using AMI-based configuration. Slurm daemons, Docker, Enroot, Pyxis, Slurm accounting with MariaDB, SSH key generation and propagation, log rotation, and home directory setup are all configured without any scripts or Amazon S3 bucket. All software is pre-packaged in the AMI, so no internet access is required during provisioning. This is the simplest path for new clusters.
-
Use default lifecycle scripts — Default lifecycle scripts are uploaded to the chosen Amazon S3 bucket and used to provision nodes. This option uses the scripts from the Awsome Distributed Training repository
(ADTR). -
Use custom lifecycle scripts — Choose lifecycle scripts from an Amazon S3 bucket. This corresponds to the
OnCreatepath in the API, where your scripts own the entire provisioning sequence, including when Slurm starts. HyperPod does not run AMI-based configuration when this option is selected.
The following table summarizes the three options:
Option What HyperPod does Amazon S3 bucket needed? Internet access needed? None (AMI-based configuration) Configures nodes automatically with Slurm and essential packages No No Use default lifecycle scripts Uploads and runs ADTR scripts from Amazon S3 Yes Yes Use custom lifecycle scripts Runs your scripts from Amazon S3; you own the full provisioning sequence Yes Depends on your scripts -
-
For Extension script file in S3 - optional (appears when you choose None under Lifecycle scripts), enter the Amazon S3 URI of your extension script. The extension script allows you to provision additional optional capabilities, such as observability, System Security Services Daemon (SSSD), and Amazon S3 bucket mounting, on top of default configurations without managing the entire set of lifecycle scripts.
Enter the full Amazon S3 URI to the entry point script, for example:
s3://DOC-EXAMPLE-BUCKET/extensions/run_extensions.shHyperPod downloads the entire folder where the entry point script resides. Structure your Amazon S3 folder so that all supporting files are in the same directory as the entry point script.
Note
In the API, this corresponds to specifying
OnInitCompleteinLifeCycleConfigwithSourceS3Uri. The console combines these into a single Amazon S3 URI field pointing directly to the entry point script.Tip
For ready-to-use extension scripts, see the Extensions folder
in the Awsome Distributed Training repository. The run_extensions.shscript orchestrates multiple capabilities with simple boolean toggles to enable or disable each one. -
For S3 bucket for lifecycle scripts (appears when you choose Use default lifecycle scripts or Use custom lifecycle scripts), choose to create a new bucket or use an existing bucket to store the lifecycle scripts.
Note
Optional node lifecycle configuration is supported only for
Slurm-orchestrated clusters. Amazon EKS-orchestrated clusters and Slurm
clusters using Continuous NodeProvisioningMode continue to
require lifecycle scripts on every instance group.
Note
The None option with an extension script and the
Use custom lifecycle scripts option are
mutually exclusive. You cannot combine AMI-based configuration with
extension script and custom lifecycle scripts on the same instance
group. In the API, this means OnCreate and
OnInitComplete cannot be specified together.
Permissions
Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf.
Storage
Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster. FSx configuration is optional for cluster creation but recommended for production ML workloads.
-
For File system, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.
-
For Throughput per unit of storage, choose the throughput that will be available per TiB of provisioned storage.
-
For Storage capacity, enter a capacity value in TB.
-
For Data compression type, choose LZ4 to enable data compression.
-
For Lustre version, view the value that's recommended for the new file systems.
Note
When using AMI-based configuration (choosing None under Lifecycle scripts) or an extension script, HyperPod handles FSx for Lustre mounting automatically. When using custom lifecycle scripts, your scripts are responsible for mounting the filesystem.
Tags - optional
For Tags - optional, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see Tagging your AWS resources.
Deploy resources
After you complete the cluster configurations using either Quick setup or Custom setup, choose the following option to start resource provisioning and cluster creation.
-
Submit - SageMaker AI will start provisioning the default configuration resources and creating the cluster.
-
Download CloudFormation template parameters - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in Creating SageMaker HyperPod clusters using CloudFormation templates.
Delete the cluster and clean resources
After you have successfully tested creating a SageMaker HyperPod cluster, it continues
running in the InService state until you delete the cluster. We
recommend that you delete any clusters created using on-demand SageMaker AI instances when
not in use to avoid incurring continued service charges based on on-demand pricing.
In this tutorial, you have created a cluster that consists of two instance groups.
One of them uses a C5 instance, so make sure you delete the cluster by following the
instructions at Delete a SageMaker HyperPod cluster.
However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.
If you used Use default lifecycle scripts or Use custom lifecycle scripts, go to the Amazon S3 bucket you used during cluster creation and remove the lifecycle script files.
If you used None (AMI-based configuration only) without an extension script, no Amazon S3 cleanup is needed for lifecycle scripts.
If you used None with an extension script, clean up the extension script files from the Amazon S3 bucket you specified.
If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.