# Managing SageMaker HyperPod EKS clusters using the AWS CLI
<a name="sagemaker-hyperpod-eks-operate-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md)
+ [Retrieving SageMaker HyperPod cluster details](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md)
+ [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md)
+ [Updating the SageMaker HyperPod platform software](sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software.md)
+ [Accessing SageMaker HyperPod cluster nodes](sagemaker-hyperpod-eks-operate-access-through-terminal.md)
+ [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md)
+ [Deleting a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-delete-cluster.md)

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-create-cluster"></a>

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.

1. Before creating an SageMaker HyperPod cluster:

   1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the *Amazon EKS User Guide*.

   1. Install the Helm chart as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). If you create a [Amazon Nova SageMaker HyperPod cluster](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html), you will need a separate Helm chart.

1. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as `s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/`.

   For a quick start, download the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh) from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.
**Important**  
If you create an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attaching only the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html), your cluster has access to Amazon S3 buckets with the specific prefix `sagemaker-`.

   If you create a restricted instance group, you don't need to download and run the lifecycle script. Instead, you need to run `install_rig_dependencies.sh`. 

   The prerequisites to run the `install_rig_dependencies.sh` script include:
   + AWS Node (CNI) and CoreDNS should both be enabled. These are standard EKS add-ons that are not managed by the standard SageMaker HyperPod Helm, but can be easily enabled in the EKS console under Add-ons.
   +  The standard SageMaker HyperPod Helm chart should be installed before running this script.

   The `install_rig_dependencies.sh` script performs the following actions. 
   + `aws-node` (CNI): New `rig-aws-node` Daemonset created; existing `aws-node` patched to avoid RIG nodes.
   + `coredns`: Converted to Daemonset for RIGs to support multi-RIG use and prevent overloading.
   + training-operators: Updated with RIG Worker taint tolerations and nodeAffinity favoring non-RIG instances.
   + Elastic Fabric Adapter (EFA): Updated to tolerate RIG worker taint and use correct container images for each Region.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).
**Note**  
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

   ```
   // create_cluster.json
   {
       "ClusterName": "string",
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
               "OnCreate": "on_create.sh"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "RestrictedInstanceGroups": [ 
         { 
            "EnvironmentConfig": { 
               "FSxLustreConfig": { 
                  "PerUnitStorageThroughput": number,
                  "SizeInGiB": number
               }
            },
            "ExecutionRole": "string",
            "InstanceCount": number,
            "InstanceGroupName": "string",
            "InstanceStorageConfigs": [ 
               { ... }
            ],
            "InstanceType": "string",
            "OnStartDeepHealthChecks": [ "string" ],
            "OverrideVpcConfig": { 
               "SecurityGroupIds": [ "string" ],
               "Subnets": [ "string" ]
            },
            "ScheduledUpdateConfig": { 
               "DeploymentConfig": { 
                  "AutoRollbackConfiguration": [ 
                     { 
                        "AlarmName": "string"
                     }
                  ],
                  "RollingUpdatePolicy": { 
                     "MaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     },
                     "RollbackMaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     }
                  },
                  "WaitIntervalInSeconds": number
               },
               "ScheduleExpression": "string"
            },
            "ThreadsPerCore": number,
            "TrainingPlanArn": "string"
         }
      ],
       "VpcConfig": {
           "SecurityGroupIds": ["string"],
           "Subnets": ["string"]
       },
       "Tags": [{
           "Key": "string",
           "Value": "string"
       }],
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "string",
               "KubernetesConfig": {
                   "Labels": {
                       "nvidia.com/mig.config": "all-3g.40gb"
                   }
               }
           }
       },
       "NodeRecovery": "Automatic"
   }
   ```
**Flexible instance groups**  
Instead of specifying a single `InstanceType`, you can use the `InstanceRequirements` parameter to specify multiple instance types for an instance group. Note the following:  
`InstanceType` and `InstanceRequirements` are mutually exclusive. You must specify one or the other, but not both.
`InstanceRequirements.InstanceTypes` is an ordered list that determines provisioning priority. SageMaker HyperPod attempts to provision the first instance type in the list and falls back to subsequent types if capacity is unavailable. You can specify up to 20 instance types, and the list must not contain duplicates.
Flexible instance groups require continuous node provisioning mode.
The following example shows an instance group using `InstanceRequirements`:  

   ```
   {
       "InstanceGroupName": "flexible-ig",
       "InstanceRequirements": {
           "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
       },
       "InstanceCount": 10,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
   }
   ```

   Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
   + You can configure up to 20 instance groups under the `InstanceGroups` parameter.
   + For `Orchestator.Eks.ClusterArn`, specify the ARN of the EKS cluster you want to use as the orchestrator.
   + For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).
   + For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.
   + For the `Tags` parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).
   + For the `VpcConfig` parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
   + For `Orchestrator.Eks.KubernetesConfig.Labels`, you can optionally specify Kubernetes labels to apply to the nodes. To enable GPU partitioning with Multi-Instance GPU (MIG), add the `nvidia.com/mig.config` label with the desired MIG profile. For example, `"nvidia.com/mig.config": "all-3g.40gb"` configures all GPUs with the 3g.40gb partition profile. For more information about GPU partitioning and available profiles, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.
**Important**  
When running the `create-cluster` command with the `--cli-input-json` parameter, you must include the `file://` prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting the `file://` prefix results in a parsing parameter error.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.
**Important**  
You can use the [update-cluster](https://docs.aws.amazon.com//cli/latest/reference/ecs/update-cluster.html) operation to remove a restricted instance group (RIG). When a RIG is scaled down to 0, the FSx for Lustre file system won't be deleted. To completely remove the FSx for Lustre file system, you must remove the RIG entirely.   
Removing a RIG will not delete any artifacts stored in the service-managed Amazon S3 bucket. However, you should ensure all artifacts in the FSx for Lustre file system are fully synchronized to Amazon S3 before removal. We recommend waiting at least 30 minutes after job completion to ensure complete synchronization of all artifacts from the FSx for Lustre file system to the service-managed Amazon S3 bucket.
**Important**  
When using an onboarded On-Demand Capacity Reservation (ODCR), you must map your instance group to the same Availability Zone ID (AZ ID) as the ODCR by setting `OverrideVpcConfig` with a subnet in the matching AZ ID.  
CRITICAL: Verify `OverrideVpcConfig` configuration before deployment to avoid incurring duplicate charges for both ODCR and On-Demand Capacity.

# Retrieving SageMaker HyperPod cluster details
<a name="sagemaker-hyperpod-eks-operate-cli-command-cluster-details"></a>

Learn how to retrieve SageMaker HyperPod cluster details using the AWS CLI.

## Describe a cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

# Updating SageMaker HyperPod cluster configuration
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
Important considerations:  
You cannot change the EKS cluster information that your HyperPod cluster is associated after the cluster is created. 
If deep health checks are running on the cluster, this API will not function as expected. You might encounter an error message stating that deep health checks are in progress. To update the cluster, you should wait until the deep health checks finish.

1. Create an [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. For each instance group, you can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.
**Note**  
You can use the `UpdateCluster` to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

      For instance groups that use `InstanceRequirements` instead of `InstanceType`, you can add or remove instance types from the `InstanceTypes` list. However, you cannot remove an instance type that has active nodes running on it. You also cannot switch between `InstanceType` and `InstanceRequirements` when updating an existing instance group. `InstanceType` and `InstanceRequirements` are mutually exclusive.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change the values for both `SourceS3Uri` and `OnCreate` as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   1. For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).

   1. For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "string",
               "OnCreate": "string"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "NodeRecovery": "Automatic"
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

# Updating the SageMaker HyperPod platform software
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software"></a>

When you create your SageMaker HyperPod cluster, SageMaker HyperPod selects an Amazon Machine Image (AMI) corresponding to the Kubernetes version of your Amazon EKS cluster.

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.
The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre.
All cluster nodes experience downtime (nodes appear as `<NotReady>` in the output of `kubectl get node`) while the patching is in progress. We recommend that you terminate all workloads before patching and resume them after the patch completes.   
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md#sagemaker-hyperpod-eks-operate-cli-command-describe-cluster).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

**Rolling upgrades with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, rolling upgrades spread each instance type proportionally across batches. For example, if an instance group has 100 instances (10 P5 and 90 G6) and you configure a 10% batch size, each batch contains 1 P5 instance and 9 G6 instances.

 When calling the `UpdateClusterSoftware` API, SageMaker HyperPod updates the Kubernetes version of the nodes by selecting the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) based on the Kubernetes version of your Amazon EKS cluster. It then runs the lifecycle scripts in the Amazon S3 bucket that you specified during the cluster creation or update. 

You can verify the kubelet version of a node by running the `kubectl describe node` command.

The Kubernetes version of SageMaker HyperPod cluster nodes does not automatically update when you update your Amazon EKS cluster version. After updating the Kubernetes version for your Amazon EKS cluster, you must use the `UpdateClusterSoftware` API to update your SageMaker HyperPod cluster nodes to the same Kubernetes version.

 It is recommended to update your SageMaker HyperPod cluster after updating your Amazon EKS nodes, and avoid having more than one version difference between the Amazon EKS cluster version and the SageMaker HyperPod cluster nodes version.

The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
You can only run this API programmatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

# Accessing SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-eks-operate-access-through-terminal"></a>

You can directly access the nodes of a SageMaker HyperPod cluster in service using the AWS CLI commands for AWS Systems Manager (SSM). Run `aws ssm start-session` with the host name of the node in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

# Scaling down a SageMaker HyperPod cluster
<a name="smcluster-scale-down"></a>

You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.

The following page outlines two main approaches to scaling down:
+ **Scale down at the instance group level:** This approach uses the `UpdateCluster` API, with which you can:
  + Scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. See [Scale down an instance group](#smcluster-scale-down-updatecluster).
  + Completely delete instance groups from your cluster. See [Delete instance groups](#smcluster-remove-instancegroup).
+ **Scale down at the instance level:** This approach uses the `BatchDeleteClusterNodes` API, with which you can specify the individual nodes you want to terminate. See [Scale down at the instance level](#smcluster-scale-down-batchdelete).

**Note**  
When scaling down at the instance level with `BatchDeleteCusterNodes`, you can only terminate a maximum of 99 instances at a time. `UpdateCluster` supports terminating any number of instances.

## Important considerations
<a name="smcluster-scale-down-considerations"></a>
+ When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions. 
+ Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).
+ To invoke this API on an existing cluster, you must first patch the cluster by running the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. For more information about patching a cluster, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
+ Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
+ You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.

## Scale down at the instance group level
<a name="smcluster-scale-down-or-delete"></a>

The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation allows you to make changes to the configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances of an instance group or removing entire instance groups. This can be useful when you want to adjust the resources allocated to your cluster based on changes in your workload, optimize costs, or change the instance type of an instance group.

### Scale down an instance group
<a name="smcluster-scale-down-updatecluster"></a>

Use this approach when you have an instance group that is idle and it's safe to terminate any of the instances for scaling down. When you submit an `UpdateCluster` request to scale down, HyperPod randomly chooses instances for termination and scales down to the specified number of nodes for the instance group.

**Scale-down behavior with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, HyperPod terminates the lowest-priority instance types first during scale-down. The priority is determined by the order of instance types in the `InstanceTypes` list, where the first type has the highest priority. This protects higher-priority instances, which are typically higher-performance, during scale-down operations.

**Note**  
When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.   
Alternatively, you can choose to remove an instance group permanently. For more information, see [Delete instance groups](#smcluster-remove-instancegroup).

**To scale down with `UpdateCluster`**

1. Follow the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md). When you reach step **1.d** where you specify the **InstanceCount** field, enter a number that is smaller than the current number of instances to scale down the cluster.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

The following is an example of an `UpdateCluster` JSON object. Consider the case where your instance group currently has 2 running instances. If you set the **InstanceCount** field to 1, as shown in the example, then HyperPod randomly selects one of the instances and terminates it.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training-instances",
      "InstanceType": "instance-type",
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    }
  ],
  "NodeRecovery": "Automatic"
}
```

### Delete instance groups
<a name="smcluster-remove-instancegroup"></a>

You can use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation to remove entire instance groups from your SageMaker HyperPod cluster when they are no longer needed. This goes beyond simple scaling down, allowing you to completely eliminate specific instance groups from your cluster's configuration. 

**Note**  
When removing an instance group:  
All instances within the targeted group are terminated.
The entire group configuration is deleted from the cluster.
Any workloads running on that instance group are stopped.

**To delete instance groups with `UpdateCluster`**

1. When following the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md):

   1. Set the optional `InstanceGroupsToDelete` parameter in your `UpdateCluster` JSON and pass the comma-separated list of instance group names that you want to delete.

   1.  When you specify the `InstanceGroups` list, ensure that the specifications of the instance groups you are removing are no longer listed in the `InstanceGroups` list.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

**Important**  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.

The following is an example of an `UpdateCluster` JSON object. Consider the case where a cluster currently has 3 instance groups, a *training*, a *prototype-training*, and an *inference-serving* group. You want to delete the *prototype-training* group.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training",
      "InstanceType": "instance-type",
      "InstanceCount": ,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    },
    {
      "InstanceGroupName": "inference-serving",
      "InstanceType": "instance-type",
      "InstanceCount": 2,
      [...]
    },
  ],
  "InstanceGroupsToDelete": [ "prototype-training" ],
  "NodeRecovery": "Automatic"
}
```

## Scale down at the instance level
<a name="smcluster-scale-down-batchdelete"></a>

The `BatchDeleteClusterNodes` operation allows you to scale down a SageMaker HyperPod cluster by specifying the individual nodes you want to terminate. `BatchDeleteClusterNodes` provides more granular control for targeted node removal and cluster optimization. For example, you might use `BatchDeleteClusterNodes` to delete targeted nodes for maintenance, rolling upgrades, or rebalancing resources geographically.

**API request and response**

When you submit a `BatchDeleteClusterNodes` request, SageMaker HyperPod deletes nodes by their instance IDs. The API accepts a request with the cluster name and a list of node IDs to be deleted. 

The response includes two sections: 
+  `Failed`: A list of errors of type `[ BatchDeleteClusterNodesError ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodesError.html)` - one per instance ID.
+  `Successful`: The list of instance IDs successfully terminated. 

**Validation and error handling**

The API performs various validations, such as:
+ Verifying the node ID format (prefix of `i-` and Amazon EC2 instance ID structure). 
+ Checking the node list length, with a limit of 99 or fewer node IDs in a single `BatchDeleteClusterNodes` request.
+ Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
+ Handling cases where instances are not found, have invalid status, or are in use.

**API Response Codes**
+  The API returns a `200` status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation). 
+  If all of these validations fail (e.g., all input nodes fail validation), the API will return a `400` Bad Request response with the appropriate error messages and error codes. 

**Example**

The following is an example of **scaling down a cluster at the instance level** using the AWS CLI:

```
aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'
```

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

This API only cleans up the SageMaker HyperPod resources and doesn’t delete any resources of the associated EKS cluster. This includes the Amazon EKS cluster, EKS Pod identities, Amazon FSx volumes, and EKS add-ons. This also includes the initial configuration you added to your EKS cluster. If you want to clean up all resources, make sure that you also clean up the EKS resources separately. 

Make sure that you first delete the SageMaker HyperPod resources, followed by the EKS resources. Performing the deletion in the reverse order may result in lingering resources.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.