

# Customizing SageMaker HyperPod clusters using lifecycle scripts
Lifecycle scripts

SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.

The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.

## High-level overview


The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a ***bottom-up*** approach.

1. Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.

1. Prepare Slurm configuration. Choose one of the following approaches:
   + **Option A: API-driven configuration (recommended)** – Define Slurm node types and partitions directly in the `CreateCluster` API payload using `SlurmConfig` within each instance group. With this approach:
     + No `provisioning_parameters.json` file is needed
     + Slurm topology is defined in the API payload alongside instance group definitions
     + FSx filesystems are configured per-instance-group via `InstanceStorageConfigs`
     + Configuration strategy is controlled via `Orchestrator.Slurm.SlurmConfigStrategy`

     Example `SlurmConfig` in an instance group:

     ```
     {
         "InstanceGroupName": "gpu-compute",
         "InstanceType": "ml.p4d.24xlarge",
         "InstanceCount": 8,
         "SlurmConfig": {
             "NodeType": "Compute",
             "PartitionNames": ["gpu-training"]
         }
     }
     ```
   + **Option B: Legacy configuration** – Prepare a `provisioning_parameters.json` file, which is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). `provisioning_parameters.json` should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.

1. Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (`lifecycle_script.py`), and write an entrypoint shell script (`on_create.sh`) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5. 

   Also, note that you should write the scripts to expect `resource_config.json` that will be generated by HyperPod during cluster creation. `resource_config.json` contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.

1. Collect all the files from the previous steps into a folder. The folder structure depends on the configuration approach you selected in Step 2.

   If you selected Option A (API-driven configuration):

   Your folder only needs lifecycle scripts for custom setup tasks. Slurm configuration and FSx mounting are handled automatically by HyperPod based on the API payload.

   ```
   └── lifecycle_files // your local folder
   
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scripts to be fed into lifecycle_script.py
   ```
**Note**  
The `provisioning_parameters.json` file is not required when using API-driven configuration.

   If you selected Option B (legacy configuration):

   Your folder must include `provisioning_parameters.json` and the full set of lifecycle scripts.

   ```
   └── lifecycle_files // your local folder
   
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

1. Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with `sagemaker-` because you need to choose an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attached with [`AmazonSageMakerClusterInstanceRolePolicy`](security-iam-awsmanpol-AmazonSageMakerClusterInstanceRolePolicy.md), which only allows S3 bucket paths starting with the prefix `sagemaker-`. The following command is an example command to upload all the files to an S3 bucket.

   ```
   aws s3 cp --recursive ./lifecycle_files s3://sagemaker-hyperpod-lifecycle/src
   ```

1. Prepare a HyperPod cluster creation request. 
   + Option 1: If you use the AWS CLI, write a cluster creation request in JSON format (`create_cluster.json`) following the instructions at [Create a new cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster).
   + Option 2: If you use the SageMaker AI console UI, fill the **Create a cluster** request form in the HyperPod console UI following the instructions at [Create a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster).

   At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.

1. Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a `resource_config.json` file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.

The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.

**Topics**
+ [

## High-level overview
](#sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview)
+ [

# Base lifecycle scripts provided by HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md)
+ [

# What particular configurations HyperPod manages in Slurm configuration files
](sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf.md)
+ [

# Slurm log rotations
](sagemaker-hyperpod-slurm-log-rotation.md)
+ [

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx.md)
+ [

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md)
+ [

# Validating runtime before running production workloads on a HyperPod Slurm cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime.md)
+ [

# Developing lifecycle scripts interactively on a HyperPod cluster node
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts.md)

# Base lifecycle scripts provided by HyperPod
Base lifecycle scripts

This section walks you through every component of the basic flow of setting up Slurm on HyperPod in a ***top-down*** approach. It starts from preparing a HyperPod cluster creation request to run the `CreateCluster` API, and dives deep into the hierarchical structure down to lifecycle scripts. Use the sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). Clone the repository by running the following command.

```
git clone https://github.com/aws-samples/awsome-distributed-training/
```

The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod are available at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config).

```
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
```

The following flowchart shows a detailed overview of how you should design the base lifecycle scripts. The descriptions below the diagram and the procedural guide explain how they work during the HyperPod `CreateCluster` API call.

![\[A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-structure.png)


***Figure:** A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts. (1) The dashed arrows are directed to where the boxes are "called into" and shows the flow of configuration files and lifecycle scripts preparation. It starts from preparing `provisioning_parameters.json` and lifecycle scripts. These are then coded in `lifecycle_script.py` for a collective execution in order. And the execution of the `lifecycle_script.py` script is done by the `on_create.sh` shell script, which to be run in the HyperPod instance terminal. (2) The solid arrows show the main HyperPod cluster creation flow and how the boxes are "called into" or "submitted to". `on_create.sh` is required for cluster creation request, either in `create_cluster.json` or the **Create a cluster** request form in the console UI. After you submit the request, HyperPod runs the `CreateCluster` API based on the given configuration information from the request and the lifecycle scripts. (3) The dotted arrow indicates that the HyperPod platform creates `resource_config.json` in the cluster instances during cluster resource provisioning. `resource_config.json` contains HyperPod cluster resource information such as the cluster ARN, instance types, and IP addresses. It is important to note that you should prepare the lifecycle scripts to expect the `resource_config.json` file during cluster creation. For more information, see the procedural guide below.*

The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.

1. `create_cluster.json` – To submit a HyperPod cluster creation request, you prepare a `CreateCluster` request file in JSON format. In this best practices example, we assume that the request file is named `create_cluster.json`. Write `create_cluster.json` to provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.

   Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name `InstanceGroups.LifeCycleConfig.SourceS3Uri` in the `CreateCluster` request form, and specify the file name of an entrypoint shell script (assume that it's named `on_create.sh`) to `InstanceGroups.LifeCycleConfig.OnCreate`.
**Note**  
If you are using the **Create a cluster** submission form in the HyperPod console UI, the console manages filling and submitting the `CreateCluster` request on your behalf, and runs the `CreateCluster` API in the backend. In this case, you don't need to create `create_cluster.json`; instead, make sure that you specify the correct cluster configuration information to the **Create a cluster** submission form.

1. `on_create.sh` – For each instance group, you need to provide an entrypoint shell script, `on_create.sh`, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are a `provisioning_parameters.json` required by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh).
**Note**  
Make sure that you upload the entire set of lifecycle scripts to the S3 location you specify in `create_cluster.json`. You should also place your `provisioning_parameters.json` in the same location.

   1. `provisioning_parameters.json` – This is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). The `on_create.sh` script finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. In `provisioning_parameters.json`, make sure that you assign the HyperPod cluster instance groups using the names you specified in `create_cluster.json` to the Slurm nodes appropriately based on how you plan to set them up.

      The following diagram shows an example of how the two JSON configuration files `create_cluster.json` and `provisioning_parameters.json` should be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.
**Tip**  
To help you validate these two JSON files, the HyperPod service team provides a validation script, [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). To learn more, see [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md).  
![\[Direct comparison between .json files.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-slurm-config.png)

      ***Figure:** Direct comparison between `create_cluster.json` for HyperPod cluster creation and `provisiong_params.json` for Slurm configuration. The number of instance groups in `create_cluster.json` should match with the number of nodes you want to configure as Slurm nodes. In case of the example in the figure, three Slurm nodes will be configured on a HyperPod cluster of three instance groups. You should assign the HyperPod cluster instance groups to Slurm nodes by specifying the instance group names accordingly.*

   1. `resource_config.json` – During cluster creation, the `lifecycle_script.py` script is written to expect a `resource_config.json` file from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.

      When you run the `CreateCluster` API, HyperPod creates a resource configuration file at `/opt/ml/config/resource_config.json` based on the `create_cluster.json` file. The file path is saved to the environment variable named `SAGEMAKER_RESOURCE_CONFIG_PATH`. 
**Important**  
The `resource_config.json` file is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example of `resource_config.json` that would be created from the cluster creation based on `create_cluster.json` in the previous step, and to help you understand what happens in the backend and how an auto-generated `resource_config.json` would look.

      ```
      {
      
          "ClusterConfig": {
              "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz",
              "ClusterName": "your-hyperpod-cluster"
          },
          "InstanceGroups": [
              {
                  "Name": "controller-machine",
                  "InstanceType": "ml.c5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "controller-machine-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "login-group",
                  "InstanceType": "ml.m5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "login-group-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "compute-nodes",
                  "InstanceType": "ml.trn1.32xlarge",
                  "Instances": [
                      {
                          "InstanceName": "compute-nodes-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-2",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-3",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-4",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              }
          ]
      }
      ```

   1. `lifecycle_script.py` – This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads in `provisioning_parameters.json` and `resource_config.json` from the paths that are specified or identified in `on_create.sh`, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.

      Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh)), mounting Amazon FSx for Lustre ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh)), and setting up MariaDB accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh)) and RDS accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh)). You can also add more scripts, package them under the same directory, and add code lines to `lifecycle_script.py` to let HyperPod run the scripts. For more information about the base lifecycle scripts, see also [3.1 Lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#31-lifecycle-scripts) in the *Awsome Distributed Training GitHub repository*.
**Note**  
HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected.

      In addition to the default setups, more scripts for installing the following software are available under the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The `lifecycle_script.py` file is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.

      1. The following code lines are for installing [Docker](https://www.docker.com/), [Enroot](https://github.com/NVIDIA/enroot), and [Pyxis](https://github.com/NVIDIA/pyxis). These packages are required to run Docker containers on a Slurm cluster. 

         To enable this installation step, set the `enable_docker_enroot_pyxis` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install Docker/Enroot/Pyxis
         if Config.enable_docker_enroot_pyxis:
             ExecuteBashScript("./utils/install_docker.sh").run()
             ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
         ```

      1. You can integrate your HyperPod cluster with [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the [Slurm dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/), the [NVIDIA DCGM Exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/), and the [EFA Metrics dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/) on Amazon Managed Grafana, you need to install the [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter), the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter), and the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). 

         To enable this installation step, set the `enable_observability` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install metric exporting software and Prometheus for observability
         
         if Config.enable_observability:
             if node_type == SlurmNodeType.COMPUTE_NODE:
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()
             
             if node_type == SlurmNodeType.HEAD_NODE:
                 wait_for_scontrol()
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_prometheus.sh").run()
         ```

1. Make sure that you upload all configuration files and setup scripts from **Step 2** to the S3 bucket you provide in the `CreateCluster` request in **Step 1**. For example, assume that your `create_cluster.json` has the following.

   ```
   "LifeCycleConfig": { 
   
       "SourceS3URI": "s3://sagemaker-hyperpod-lifecycle/src",
       "OnCreate": "on_create.sh"
   }
   ```

   Then, your `"s3://sagemaker-hyperpod-lifecycle/src"` should contain `on_create.sh`, `lifecycle_script.py`, `provisioning_parameters.json`, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.

   ```
   └── lifecycle_files // your local folder
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

   To upload the files, use the S3 command as follows.

   ```
   aws s3 cp --recursive ./lifecycle_scripts s3://sagemaker-hyperpod-lifecycle/src
   ```

# What particular configurations HyperPod manages in Slurm configuration files
Slurm configuration files

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows which specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you **do not** change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

# Slurm log rotations


SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.

## How log rotation works


When enabled, the log rotation configuration:
+ Monitors all Slurm log files with the extension `.log` located in the `/var/log/slurm/` folder on the controller, login and compute nodes.
+ Rotates logs when they reach 50 MB in size.
+ Maintains up to two rotated log files before deleting them.
+ Sends SIGUSR2 signal to Slurm daemons (`slurmctld`, `slurmd`, and `slurmdbd`) after rotation.

## List of log files rotated


Slurm logs are located in the `/var/log/slurm/` directory. Log rotation is enabled for all files that match `/var/log/slurm/*.log`. When rotation occurs, rotated files have numerical suffixes (such as `slurmd.log.1`). The following list is not exhaustive but shows some of the critical log files that rotate automatically:
+ `/var/log/slurm/slurmctld.log`
+ `/var/log/slurm/slurmd.log`
+ `/var/log/slurm/slurmdb.log`
+ `/var/log/slurm/slurmrestd.log`

## Enable or disable log rotation


You can control the log rotation feature using the `enable_slurm_log_rotation` parameter in the `config.py` script of your cluster's lifecycle scripts, as shown in the following example:

```
class Config:
    # Set false if you want to disable log rotation of Slurm daemon logs
    enable_slurm_log_rotation = True  # Default value
```

To disable log rotation, set the parameter to `False`, as shown in the following example:

```
enable_slurm_log_rotation = False
```

**Note**  
Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation. They also run on new nodes when added to the cluster. Updating the log rotation configurations must be done manually after cluster creation. The log rotation configuration is stored in `/etc/logrotate.d/sagemaker-hyperpod-slurm`. We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable log rotation, delete the `sagemaker-hyperpod-slurm` file or comment out its contents by adding `#` at the start of each line in the `sagemaker-hyperpod-slurm` file.

## Default log rotation settings


The following settings are configured automatically for each log file rotated:


| Setting | Value | Description | 
| --- | --- | --- | 
| rotate | 2 | Number of rotated log files to keep | 
| size | 50 MB | Maximum size before rotation | 
| copytruncate | enabled | Copies and truncates the original log file | 
| compress | disabled | Rotated logs are not compressed | 
| missingok | enabled | No error if log file is missing | 
| notifempty | enabled | Doesn't rotate empty files | 
| noolddir | enabled | Rotated files stay in same directory | 

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
Mounting FSx for Lustre and Amazon FSx for OpenZFS to a cluster

To mount an Amazon FSx for Lustre shared file system to your HyperPod cluster, set up the following.

1. Use your Amazon VPC. 

   1. For HyperPod cluster instances to communicate within your VPC, make sure that you attach the [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc) to the IAM role for SageMaker HyperPod. 

   1. In `create_cluster.json`, include the following VPC information.

      ```
      "VpcConfig": { 
          "SecurityGroupIds": [ "string" ],
          "Subnets": [ "string" ]
      }
      ```

      For more tips about setting up Amazon VPC, see [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

1. To finish configuring Slurm with Amazon FSx for Lustre, you can use one of the following approaches. You can find the Amazon FSx information either from the Amazon FSx for Lustre console in your account or by running the following AWS CLI command, `aws fsx describe-file-systems`.

   **Option A: API-Driven Configuration (Recommended)**

   Specify the Amazon FSx configuration directly in the CreateCluster API payload using `InstanceStorageConfigs` within each instance group. This approach supports both FSx for Lustre and FSx for OpenZFS, and allows per-instance-group FSx configuration.

   ```
   "InstanceStorageConfigs": [
       {
           "FsxLustreConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx",
               "MountName": "1abcdefg"
           }
       }
   ]
   ```

   For FSx for OpenZFS, use `FsxOpenZfsConfig` instead:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx-openzfs"
           }
       }
   ]
   ```

   For more details, see [Getting started with SageMaker HyperPod using the AWS CLI](sagemaker-hyperpod-quickstart.md).

   **Option B: Legacy Configuration**

   Specify the Amazon FSx DNS name and Amazon FSx mount name in `provisioning_parameters.json` as shown in the figure in the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section.

   ```
   "fsx_dns_name": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
   "fsx_mountname": "1abcdefg"
   ```

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
Validating configuration files

To validate the JSON configuration files before submitting a cluster creation request, use the configuration validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). This script parses and compares your HyperPod cluster configuration JSON file and Slurm configuration JSON file, and identifies if there's any resource misconfiguration between the two files and also across Amazon EC2, Amazon VPC, and Amazon FSx resources. For example, to validate the `create_cluster.json` and `provisioning_parameters.json` files from the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section, run the validation script as follows.

```
python3 validate-config.py --cluster-config create_cluster.json --provisioning-parameters provisioning_parameters.json
```

The following is an example output of a successful validation.

```
✔️  Validated instance group name worker-group-1 is correct ...

✔️  Validated subnet subnet-012345abcdef67890 ...
✔️  Validated security group sg-012345abcdef67890 ingress rules ...
✔️  Validated security group sg-012345abcdef67890 egress rules ...
✔️  Validated FSx Lustre DNS name fs-012345abcdef67890.fsx.us-east-1.amazonaws.com
✔️  Validated FSx Lustre mount name abcdefgh
✅ Cluster Validation succeeded
```

# Validating runtime before running production workloads on a HyperPod Slurm cluster
Validating runtime

To check the runtime before running any production workloads on a Slurm cluster on HyperPod, use the runtime validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py). This script checks if the Slurm cluster has all packages installed for running Docker, if the cluster has a properly mounted FSx for Lustre file system and a user directory sharing the file system, and if the Slurm deamon is running on all compute nodes.

To run the script on multiple nodes at once, use `srun` as shown in the following example command of running the script on a Slurm cluster of 8 nodes.

```
# The following command runs on 8 nodes
srun -N 8 python3 hyperpod-precheck.py
```

**Note**  
To learn more about the validation script such as what runtime validation functions the script provides and guidelines to resolve issues that don't pass the validations, see [Runtime validation before running workloads](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#35-runtime-validation-before-running-workloads) in the *Awsome Distributed Training GitHub repository*.

# Developing lifecycle scripts interactively on a HyperPod cluster node
Developing interactive lifecycle scripts

This section explains how you can interactively develop lifecycle scripts without repeatedly creating and deleting a HyperPod cluster.

1. Create a HyperPod cluster with the base lifecycle scripts.

1. Log in to a cluster node.

1. Develop a script (`configure_xyz.sh`) by editing and running it repeatedly on the node.

   1. HyperPod runs the lifecycle scripts as the root user, so we recommend that you run the `configure_xyz.sh` as the root user while developing to make sure that the script is tested under the same condition while run by HyperPod.

1. Integrate the script into `lifecycle_script.py` by adding a code line similar to the following.

   ```
   ExecuteBashScript("./utils/configure_xyz.sh").run()
   ```

1. Upload the updated lifecycle scripts to the S3 bucket that you initially used for uploading the base lifecycle scripts.

1. Test the integrated version of `lifecycle_script.py` by creating a new HyperPod cluster. You can also use manual instance replacement to test the updated lifecycle scripts by creating new instances. For detailed instructions, see [Manually replace a node](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html#sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace). Note that only worker nodes are replaceable.