Provisioning resources using CloudFormation stacks
To set up multiple controller nodes in a HyperPod Slurm cluster, provision AWS resources through two CloudFormation stacks: Provision basic resources and Provision additional resources to support multiple controller nodes.
Provision basic resources
Follow these steps to provision basic resources for your Amazon SageMaker HyperPod Slurm cluster.
-
Download the sagemaker-hyperpod.yaml
template file to your machine. This YAML file is an CloudFormation template that defines the following resources to create for your Slurm cluster. -
An execution IAM role for the compute node instance group
-
An Amazon S3 bucket to store the lifecycle scripts
-
Public and private subnets (private subnets have internet access through NAT gateways)
-
Internet Gateway/NAT gateways
-
Two Amazon EC2 security groups
-
An Amazon FSx volume to store configuration files
-
-
Run the following CLI command to create a CloudFormation stack named
sagemaker-hyperpod. Define the Availability Zone (AZ) IDs for your cluster inPrimarySubnetAZandBackupSubnetAZ. For example,use1-az4is an AZ ID for an Availability Zone in theus-east-1Region. For more information, see Availability Zone IDs and Setting up SageMaker HyperPod clusters across multiple AZs.aws cloudformation deploy \ --template-file/path_to_template/sagemaker-hyperpod.yaml\ --stack-namesagemaker-hyperpod\ --parameter-overrides PrimarySubnetAZ=use1-az4BackupSubnetAZ=use1-az1\ --capabilitiesCAPABILITY_IAMFor more information, see deploy from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.
Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - sagemaker-hyperpod -
(Optional) Verify the stack in the CloudFormation console
. -
From the left navigation, choose Stack.
-
On the Stack page, find and choose sagemaker-hyperpod.
-
Choose the tabs like Resources and Outputs to review the resources and outputs.
-
-
Create environment variables from the stack (
sagemaker-hyperpod) outputs. You will use values of these variables to Provision additional resources to support multiple controller nodes.source .env PRIMARY_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`PrimaryPrivateSubnet`].OutputValue' --output text) BACKUP_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`BackupPrivateSubnet`].OutputValue' --output text) EMAIL=$(bash -c 'read -p "INPUT YOUR SNSSubEmailAddress HERE: " && echo $REPLY') DB_USER_NAME=$(bash -c 'read -p "INPUT YOUR DB_USER_NAME HERE: " && echo $REPLY') SECURITY_GROUP=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`SecurityGroup`].OutputValue' --output text) ROOT_BUCKET_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonS3BucketName`].OutputValue' --output text) SLURM_FSX_DNS_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemDNSname`].OutputValue' --output text) SLURM_FSX_MOUNT_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemMountname`].OutputValue' --output text) COMPUTE_NODE_ROLE=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonSagemakerClusterExecutionRoleArn`].OutputValue' --output text)When you see prompts asking for your email address and database user name, enter values like the following.
INPUT YOUR SNSSubEmailAddress HERE:Email_address_to_receive_SNS_notificationsINPUT YOUR DB_USER_NAME HERE:Database_user_name_you_defineTo verify variable values, use the
printcommand.$variableprint $REGION us-east-1
Provision additional resources to support multiple controller nodes
Follow these steps to provision additional resources for your Amazon SageMaker HyperPod Slurm cluster with multiple controller nodes.
-
Download the sagemaker-hyperpod-slurm-multi-headnode.yaml
template file to your machine. This second YAML file is an CloudFormation template that defines the additional resources to create for multiple controller nodes support in your Slurm cluster. -
An execution IAM role for the controller node instance group
-
An Amazon RDS for MariaDB instance
-
An Amazon SNS topic and subscription
-
AWS Secrets Manager credentials for Amazon RDS for MariaDB
-
-
Run the following CLI command to create a CloudFormation stack named
sagemaker-hyperpod-mh. This second stack uses the CloudFormation template to create additional AWS resources to support the multiple controller nodes architecture.aws cloudformation deploy \ --template-file/path_to_template/slurm-multi-headnode.yaml\ --stack-namesagemaker-hyperpod-mh\ --parameter-overrides \ SlurmDBSecurityGroupId=$SECURITY_GROUP \ SlurmDBSubnetGroupId1=$PRIMARY_SUBNET \ SlurmDBSubnetGroupId2=$BACKUP_SUBNET \ SNSSubEmailAddress=$EMAIL \ SlurmDBUsername=$DB_USER_NAME \ --capabilitiesCAPABILITY_NAMED_IAMFor more information, see deploy from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.
Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - sagemaker-hyperpod-mh -
(Optional) Verify the stack in the AWS Cloud Formation console
. -
From the left navigation, choose Stack.
-
On the Stack page, find and choose sagemaker-hyperpod-mh.
-
Choose the tabs like Resources and Outputs to review the resources and outputs.
-
-
Create environment variables from the stack (
sagemaker-hyperpod-mh) outputs. You will use values of these variables to update the configuration file (provisioning_parameters.json) in Preparing and uploading lifecycle scripts.source .env SLURM_DB_ENDPOINT_ADDRESS=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBEndpointAddress`].OutputValue' --output text) SLURM_DB_SECRET_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBSecretArn`].OutputValue' --output text) SLURM_EXECUTION_ROLE_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmExecutionRoleArn`].OutputValue' --output text) SLURM_SNS_FAILOVER_TOPIC_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmFailOverSNSTopicArn`].OutputValue' --output text)