

# SageMaker HyperPod cluster resiliency
Cluster resiliency

SageMaker HyperPod through Slurm orchestration provides the following cluster resiliency features.

**Topics**
+ [

# Health monitoring agent
](sagemaker-hyperpod-resiliency-slurm-cluster-health-check.md)
+ [

# Automatic node recovery and auto-resume
](sagemaker-hyperpod-resiliency-slurm-auto-resume.md)
+ [

# Manually replace or reboot a node using Slurm
](sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.md)

# Health monitoring agent


This section describes the set of health checks that SageMaker HyperPod uses to regularly monitor cluster instance health for issues with devices such as accelerators (GPU and Trainium cores) and networking (EFA). SageMaker HyperPod health-monitoring agent (HMA) continuously monitors the health status of each GPU-based or Trainium-based instance. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

SageMaker HyperPod HMA performs the same health checks for both EKS and Slurm orchestrators. For more information about HMA, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).

# Automatic node recovery and auto-resume


**Note**  
As of September 11, 2025, HyperPod with Slurm orchestration now supports health monitoring agents. Run [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) and update to the latest version of the AMI in order to use this functionality.

This section talks about Amazon SageMaker HyperPod's two complementary resilience features: automatic node recovery that replaces faulty infrastructure without manual intervention, and auto-resume functionality that restarts training jobs from the last checkpoint after hardware failures.

## How automatic node recovery works


During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option. By default, the clusters are set up with Automatic node recovery.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. We do not recommend this option.

## Running a training job with the Amazon SageMaker HyperPod auto-resume functionality


This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure.

With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced. The following hardware checks are run whenever a job fails while using auto-resume:


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NVIDIA SMI | GPU | [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) utility is a well-known CLI to manage and monitor GPUs. The built-in health checker parses the output from nvidia-smi to determine the health of the instance. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Network | EFA | GPU and Trainium | To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance. | 

**Note**  
When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Using the SageMaker HyperPod auto-resume functionality with Slurm**

When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an exclusive allocation acquired either by using `salloc` or `sbatch`. In any case, you need to modify the entrypoint script to make sure that all setup steps run in a single `srun` command when resuming the job. Through the entrypoint script, it is important to set up the environment on the replaced node to be consistent with the environment that the job step was running before it was stopped. The following procedure shows how to prepare an entrypoint script to keep the environment consistent and run it as a single `srun` command.

**Tip**  
If you use `sbatch`, you can keep the batch script simple by creating a separate script for setting up the environment and using a single `srun` command.

1. Create a script using the following code example and save it as `train_auto_resume.sh`. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.
**Note**  
The following code example shows how to discover the Slurm node list associated with the job. Do not use the `$SLURM_JOB_NODELIST` environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a new `NODE_LIST` variable to replace `SLURM_JOB_NODELIST`, and then set up the `MASTER_NODE` and `MASTER_ADDR` variables off of the `NODE_LIST` variable.

   ```
   #!/bin/bash
   
   # Filename: train_auto_resume.sh
   # Sample containerized script to launch a training job with a single srun which can be auto-resumed.
   
   # Place your training environment setup here. 
   # Example: Install conda, docker, activate virtual env, etc.
   
   # Get the list of nodes for a given job
   NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job
               awk -F= '/NodeList=/{print $2}' | \  # Extract NodeList field
               grep -v Exc)                         # Exclude nodes marked as excluded
   
   # Determine the master node from the node list
   MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames
                 head -n 1)                            # Select the first hostname as master node
   
   # Get the master node address
   MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information
                 awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr
                 awk '{print $1}')                   # Print the first part of NodeAddr
   
   
   # Torchrun command to launch the training job
   torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \
                          --nproc_per_node=1 \
                          --node_rank=$SLURM_NODE \
                          --master-addr=$MASTER_ADDR \
                          --master_port=1234 \
                          <your_training_script.py>"
   
   # Execute the torchrun command in the 'pytorch' Conda environment, 
   # streaming output live
   /opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmd
   ```
**Tip**  
You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the [set of lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.

1. Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag `--auto-resume=1` to indicate that the `srun` command should be automatically retried in case of hardware failure. 
**Note**  
If you have set up a resource allocation using `sbatch` or `salloc`, you can run multiple `srun` commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current [job step](https://slurm.schedmd.com/job_launch.html#step_allocation) of the `srun` command with the flag `--auto-resume=1`. In other words, activating auto-resume in an `srun` command doesn't apply to other `srun` commands launched within a resource allocation session.

   The following are `srun` command examples with `auto-resume` enabled.

   **Using sbatch**

   Because most of the logic for setting up the environment is already in `train_auto_resume.sh`, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved as `batch.sh`.

   ```
   #!/bin/bash
   #SBATCH --nodes 2
   #SBATCH --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

   Run the preceding batch script using the following command.

   ```
   sbatch batch.sh
   ```

   **Using salloc**

   Start by acquiring an exclusive allocation, and run the `srun` command with the `--auto-resume` flag and the entrypoint script.

   ```
   salloc -N 2 --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

## How automatic node recovery and auto-resume work together


When both automatic node recovery and auto-resume systems are active, they follow a coordinated approach to handling failures. If the HMA detects a hardware fault, the node is marked for drain regardless of job-level status. With node automatic recovery enabled, the nodes are automatically replaced once all the jobs running in the nodes exit. In this scenario, for jobs with auto-resume enabled, if there is a non-zero exit status in the step, the auto resume kicks in (the jobs resume once nodes are replaced). Jobs without auto-resume enabled will simply exit, requiring manual resubmission by administrators or users.

**Note**  
If you use auto-resume, the nodes are always replaced (no reboots) when hardware failures are detected.

# Manually replace or reboot a node using Slurm


This section talks about when you should manually reboot or replace a node, with instructions on how to do both.

## When to manually reboot or replace a node


The HyperPod auto-resume functionality monitors if the state of your Slurm nodes turns to `fail` or `down`. You can check the state of Slurm nodes by running `sinfo`.

If a node remains stuck or unresponsive and the auto-resume process does not recover it, you can manually initiate recovery. The choice between rebooting and replacing a node depends on the nature of the issue. Consider rebooting when facing temporary or software-related problems, such as system hangs, memory leaks, GPU driver issues, kernel updates, or hung processes. However, if you encounter persistent or hardware-related problems like failing GPUs, memory or networking faults, repeated health check failures, or nodes that remain unresponsive after multiple reboot attempts, node replacement is the more appropriate solution.

## Ways to manually reboot or replace nodes


SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use traditional Slurm commands like `scontrol update`, though this legacy method requires direct access to the Slurm's controller node. Both methods activate the same SageMaker HyperPod recovery processes.

## Manually reboot a node using reboot API


 You can use the **BatchRebootClusterNodes** to manually reboot a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually replace a node using replace API


 You can use the **BatchReplaceClusterNodes** to manually replace a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually reboot a node using Slurm


You can also use the scontrol Slurm commands to trigger node recovery. These commands interact directly with the Slurm control plane and invoke the same underlying SageMaker HyperPod recovery mechanisms. 

In the following command , replace <ip-ipv4> with the Slurm node name (host name) of the faulty instance you want to reboot.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Reboot"
```

This marks the node as FAIL with the specified reason. SageMaker HyperPod detects this and reboots the instance. Avoid changing the node state or restarting the Slurm controller during the operation.

## Manually replace a node using Slurm


You can use the scontrol update command as follows to replace a node.

In the following command, replace `<ip-ipv4>` with the Slurm node name (host name) of the faulty instance you want to replace.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Replace"
```

After running this command, the node will go into the `fail` state, waits for the currently running jobs to finish, is replaced with a healthy instance, and is recovered with the same host name. This process takes time depending on the available instances in your Availability Zone and the time it takes to run your lifecycle scripts. During the update and replacement processes, avoid changing the state of the node manually again or restarting the Slurm controller; doing so can lead to a replacement failure. If the node does not get recovered nor turn to the `idle` state after a long time, contact [AWS Support](https://console.aws.amazon.com/support/).

## Manually force change a node


If the faulty node is continuously stuck in the `fail` state, the last resort you might try is to manually force change the node state to `down`. This requires administrator privileges (sudo permissions).

**Warning**  
Proceed carefully before you run the following command as it forces kill all jobs, and you might lose all unsaved work.

```
scontrol update node=<ip-ipv4> state=down reason="Action:Replace"
```