BatchAddClusterNodes - Amazon SageMaker

BatchAddClusterNodes

Adds nodes to a HyperPod cluster by incrementing the target count for one or more instance groups. This operation returns a unique NodeLogicalId for each node being added, which can be used to track the provisioning status of the node. This API provides a safer alternative to UpdateCluster for scaling operations by avoiding unintended configuration changes.

Note

This API is only supported for clusters using Continuous as the NodeProvisioningMode.

Request Syntax

{ "ClientToken": "string", "ClusterName": "string", "NodesToAdd": [ { "IncrementTargetCountBy": number, "InstanceGroupName": "string" } ] }

Request Parameters

For information about the parameters that are common to all actions, see Common Parameters.

The request accepts the following data in JSON format.

ClientToken

A unique, case-sensitive identifier that you provide to ensure the idempotency of the request. This token is valid for 8 hours. If you retry the request with the same client token within this timeframe and the same parameters, the API returns the same set of NodeLogicalIds with their latest status.

Type: String

Length Constraints: Minimum length of 0. Maximum length of 64.

Pattern: [\x21-\x7E]+

Required: No

ClusterName

The name of the HyperPod cluster to which you want to add nodes.

Type: String

Length Constraints: Minimum length of 0. Maximum length of 256.

Pattern: (arn:aws[a-z\-]*:sagemaker:[a-z0-9\-]*:[0-9]{12}:cluster/[a-z0-9]{12})|([a-zA-Z0-9](-*[a-zA-Z0-9]){0,62})

Required: Yes

NodesToAdd

A list of instance groups and the number of nodes to add to each. You can specify up to 5 instance groups in a single request, with a maximum of 50 nodes total across all instance groups.

Type: Array of AddClusterNodeSpecification objects

Array Members: Minimum number of 1 item. Maximum number of 5 items.

Required: Yes

Response Syntax

{ "Failed": [ { "ErrorCode": "string", "FailedCount": number, "InstanceGroupName": "string", "Message": "string" } ], "Successful": [ { "InstanceGroupName": "string", "NodeLogicalId": "string", "Status": "string" } ] }

Response Elements

If the action is successful, the service sends back an HTTP 200 response.

The following data is returned in JSON format by the service.

Failed

A list of errors that occurred during the node addition operation. Each entry includes the instance group name, error code, number of failed additions, and an error message.

Type: Array of BatchAddClusterNodesError objects

Successful

A list of NodeLogicalIDs that were successfully added to the cluster. The NodeLogicalID is unique per cluster and does not change between instance replacements. Each entry includes a NodeLogicalId that can be used to track the node's provisioning status (with DescribeClusterNode), the instance group name, and the current status of the node.

Type: Array of NodeAdditionResult objects

Errors

For information about the errors that are common to all actions, see Common Errors.

ResourceLimitExceeded

You have exceeded an SageMaker resource limit. For example, you might have too many training jobs created.

HTTP Status Code: 400

ResourceNotFound

Resource being access is not found.

HTTP Status Code: 400

See Also

For more information about using this API in one of the language-specific AWS SDKs, see the following: