Base lifecycle scripts provided by HyperPod
This section walks you through every component of the basic flow of setting up Slurm
            on HyperPod in a top-down
            approach. It starts from preparing a HyperPod cluster creation request to run
            the CreateCluster API, and dives deep into the hierarchical structure down
            to lifecycle scripts. Use the sample lifecycle scripts provided in the Awsome Distributed
                Training GitHub repository
git clone https://github.com/aws-samples/awsome-distributed-training/
The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod are
            available at 1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
The following flowchart shows a detailed overview of how you should design the base
            lifecycle scripts. The descriptions below the diagram and the procedural guide explain
            how they work during the HyperPod CreateCluster API call.
 
             
             
        Figure: A detailed flow chart of
                HyperPod cluster creation and the structure of lifecycle scripts. (1) The
                dashed arrows are directed to where the boxes are "called into" and shows the flow
                of configuration files and lifecycle scripts preparation. It starts from preparing
                    provisioning_parameters.json and lifecycle scripts. These are then
                coded in lifecycle_script.py for a collective execution in order. And
                the execution of the lifecycle_script.py script is done by the
                    on_create.sh shell script, which to be run in the HyperPod
                instance terminal. (2) The solid arrows show the main HyperPod cluster
                creation flow and how the boxes are "called into" or "submitted to".
                    on_create.sh is required for cluster creation request, either in
                    create_cluster.json or the Create a cluster
                request form in the console UI. After you submit the request, HyperPod runs
                the CreateCluster API based on the given configuration information from
                the request and the lifecycle scripts. (3) The dotted arrow indicates that the
                HyperPod platform creates resource_config.json in the cluster
                instances during cluster resource provisioning. resource_config.json
                contains HyperPod cluster resource information such as the cluster ARN,
                instance types, and IP addresses. It is important to note that you should prepare
                the lifecycle scripts to expect the resource_config.json file during
                cluster creation. For more information, see the procedural guide
            below.
The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.
- 
                create_cluster.json– To submit a HyperPod cluster creation request, you prepare aCreateClusterrequest file in JSON format. In this best practices example, we assume that the request file is namedcreate_cluster.json. Writecreate_cluster.jsonto provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name InstanceGroups.LifeCycleConfig.SourceS3Uriin theCreateClusterrequest form, and specify the file name of an entrypoint shell script (assume that it's namedon_create.sh) toInstanceGroups.LifeCycleConfig.OnCreate.NoteIf you are using the Create a cluster submission form in the HyperPod console UI, the console manages filling and submitting the CreateClusterrequest on your behalf, and runs theCreateClusterAPI in the backend. In this case, you don't need to createcreate_cluster.json; instead, make sure that you specify the correct cluster configuration information to the Create a cluster submission form.
- 
                on_create.sh– For each instance group, you need to provide an entrypoint shell script,on_create.sh, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are aprovisioning_parameters.jsonrequired by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script aton_create.sh. NoteMake sure that you upload the entire set of lifecycle scripts to the S3 location you specify in create_cluster.json. You should also place yourprovisioning_parameters.jsonin the same location.- 
                        provisioning_parameters.json– This is a Configuration form for provisioning Slurm nodes on HyperPod. Theon_create.shscript finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. Inprovisioning_parameters.json, make sure that you assign the HyperPod cluster instance groups using the names you specified increate_cluster.jsonto the Slurm nodes appropriately based on how you plan to set them up.The following diagram shows an example of how the two JSON configuration files create_cluster.jsonandprovisioning_parameters.jsonshould be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.TipTo help you validate these two JSON files, the HyperPod service team provides a validation script, validate-config.py. To learn more, see Validating the JSON configuration files before creating a Slurm cluster on HyperPod.   Figure: Direct comparison between create_cluster.jsonfor HyperPod cluster creation andprovisiong_params.jsonfor Slurm configuration. The number of instance groups increate_cluster.jsonshould match with the number of nodes you want to configure as Slurm nodes. In case of the example in the figure, three Slurm nodes will be configured on a HyperPod cluster of three instance groups. You should assign the HyperPod cluster instance groups to Slurm nodes by specifying the instance group names accordingly.
- 
                        resource_config.json– During cluster creation, thelifecycle_script.pyscript is written to expect aresource_config.jsonfile from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.When you run the CreateClusterAPI, HyperPod creates a resource configuration file at/opt/ml/config/resource_config.jsonbased on thecreate_cluster.jsonfile. The file path is saved to the environment variable namedSAGEMAKER_RESOURCE_CONFIG_PATH.ImportantThe resource_config.jsonfile is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example ofresource_config.jsonthat would be created from the cluster creation based oncreate_cluster.jsonin the previous step, and to help you understand what happens in the backend and how an auto-generatedresource_config.jsonwould look.{ "ClusterConfig": { "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz", "ClusterName": "your-hyperpod-cluster" }, "InstanceGroups": [ { "Name": "controller-machine", "InstanceType": "ml.c5.xlarge", "Instances": [ { "InstanceName": "controller-machine-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] }, { "Name": "login-group", "InstanceType": "ml.m5.xlarge", "Instances": [ { "InstanceName": "login-group-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] }, { "Name": "compute-nodes", "InstanceType": "ml.trn1.32xlarge", "Instances": [ { "InstanceName": "compute-nodes-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-2", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-3", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-4", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] } ] }
- 
                        lifecycle_script.py– This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads inprovisioning_parameters.jsonandresource_config.jsonfrom the paths that are specified or identified inon_create.sh, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample lifecycle_script.pyscript is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ( start_slurm.sh), mounting Amazon FSx for Lustre ( mount_fsx.sh), and setting up MariaDB accounting ( setup_mariadb_accounting.sh) and RDS accounting ( setup_rds_accounting.sh). You can also add more scripts, package them under the same directory, and add code lines to lifecycle_script.pyto let HyperPod run the scripts. For more information about the base lifecycle scripts, see also 3.1 Lifecycle scriptsin the Awsome Distributed Training GitHub repository. NoteHyperPod runs SageMaker HyperPod DLAMI on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected. In addition to the default setups, more scripts for installing the following software are available under the utilsfolder. The lifecycle_script.pyfile is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.- 
                                The following code lines are for installing Docker , Enroot , and Pyxis . These packages are required to run Docker containers on a Slurm cluster. To enable this installation step, set the enable_docker_enroot_pyxisparameter toTruein theconfig.pyfile. # Install Docker/Enroot/Pyxis if Config.enable_docker_enroot_pyxis: ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
- 
                                You can integrate your HyperPod cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the Slurm dashboard , the NVIDIA DCGM Exporter dashboard , and the EFA Metrics dashboard on Amazon Managed Grafana, you need to install the Slurm exporter for Prometheus , the NVIDIA DCGM exporter , and the EFA node exporter . For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see SageMaker HyperPod cluster resources monitoring. To enable this installation step, set the enable_observabilityparameter toTruein theconfig.pyfile. # Install metric exporting software and Prometheus for observability if Config.enable_observability: if node_type == SlurmNodeType.COMPUTE_NODE: ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_dcgm_exporter.sh").run() ExecuteBashScript("./utils/install_efa_node_exporter.sh").run() if node_type == SlurmNodeType.HEAD_NODE: wait_for_scontrol() ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_slurm_exporter.sh").run() ExecuteBashScript("./utils/install_prometheus.sh").run()
 
- 
                                
 
- 
                        
- 
                Make sure that you upload all configuration files and setup scripts from Step 2 to the S3 bucket you provide in the CreateClusterrequest in Step 1. For example, assume that yourcreate_cluster.jsonhas the following."LifeCycleConfig": { "SourceS3URI": "s3://sagemaker-hyperpod-lifecycle/src", "OnCreate": "on_create.sh" }Then, your "s3://sagemaker-hyperpod-lifecycle/src"should containon_create.sh,lifecycle_script.py,provisioning_parameters.json, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.└── lifecycle_files // your local folder ├── provisioning_parameters.json ├── on_create.sh ├── lifecycle_script.py └── ... // more setup scrips to be fed into lifecycle_script.pyTo upload the files, use the S3 command as follows. aws s3 cp --recursive./lifecycle_scriptss3://sagemaker-hyperpod-lifecycle/src