AWS::SageMaker::Cluster - AWS CloudFormation

This is the new CloudFormation Template Reference Guide. Please update your bookmarks and links. For help getting started with CloudFormation, see the AWS CloudFormation User Guide.

AWS::SageMaker::Cluster

Creates an Amazon SageMaker HyperPod cluster. SageMaker HyperPod is a capability of SageMaker for creating and managing persistent clusters for developing large machine learning models, such as large language models (LLMs) and diffusion models. To learn more, see Amazon SageMaker HyperPod in the Amazon SageMaker Developer Guide.

Syntax

To declare this entity in your CloudFormation template, use the following syntax:

JSON

{ "Type" : "AWS::SageMaker::Cluster", "Properties" : { "AutoScaling" : ClusterAutoScalingConfig, "ClusterName" : String, "ClusterRole" : String, "InstanceGroups" : [ ClusterInstanceGroup, ... ], "NodeProvisioningMode" : String, "NodeRecovery" : String, "Orchestrator" : Orchestrator, "RestrictedInstanceGroups" : [ ClusterRestrictedInstanceGroup, ... ], "Tags" : [ Tag, ... ], "TieredStorageConfig" : TieredStorageConfig, "VpcConfig" : VpcConfig } }

Properties

AutoScaling

The autoscaling configuration for the cluster. Enables automatic scaling of cluster nodes based on workload demand using a Karpenter-based system.

Required: No

Type: ClusterAutoScalingConfig

Update requires: No interruption

ClusterName

The name of the SageMaker HyperPod cluster.

Required: No

Type: String

Pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}$

Minimum: 1

Maximum: 63

Update requires: Replacement

ClusterRole

The Amazon Resource Name (ARN) of the IAM role that HyperPod assumes to perform cluster autoscaling operations. This role must have permissions for sagemaker:BatchAddClusterNodes and sagemaker:BatchDeleteClusterNodes. This is only required when autoscaling is enabled and when HyperPod is performing autoscaling operations.

Required: No

Type: String

Pattern: ^arn:aws[a-z\-]*:iam::\d{12}:role/?[a-zA-Z_0-9+=,.@\-_/]+$

Minimum: 20

Maximum: 2048

Update requires: No interruption

InstanceGroups

The instance groups of the SageMaker HyperPod cluster. To delete an instance group, remove it from the array.

Required: No

Type: Array of ClusterInstanceGroup

Minimum: 1

Update requires: No interruption

NodeProvisioningMode

The mode for provisioning nodes in the cluster. You can specify the following modes:

  • Continuous: Scaling behavior that enables 1) concurrent operation execution within instance groups, 2) continuous retry mechanisms for failed operations, 3) enhanced customer visibility into cluster events through detailed event streams, 4) partial provisioning capabilities. Your clusters and instance groups remain InService while scaling. This mode is only supported for EKS orchestrated clusters.

Required: No

Type: String

Allowed values: Continuous

Update requires: No interruption

NodeRecovery

Specifies whether to enable or disable the automatic node recovery feature of SageMaker HyperPod. Available values are Automatic for enabling and None for disabling.

Required: No

Type: String

Allowed values: Automatic | None

Update requires: No interruption

Orchestrator

The orchestrator type for the SageMaker HyperPod cluster. Currently, 'eks' is the only available option.

Required: No

Type: Orchestrator

Update requires: No interruption

RestrictedInstanceGroups

The specialized instance groups for training models like Amazon Nova to be created in the SageMaker HyperPod cluster.

Required: No

Type: Array of ClusterRestrictedInstanceGroup

Minimum: 1

Update requires: No interruption

Tags

A tag object that consists of a key and an optional value, used to manage metadata for SageMaker AWS resources.

You can add tags to notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, models, labeling jobs, work teams, endpoint configurations, and endpoints. For more information on adding tags to SageMaker resources, see AddTags.

For more information on adding metadata to your AWS resources with tagging, see Tagging AWS resources. For advice on best practices for managing AWS resources with tagging, see Tagging Best Practices: Implement an Effective AWS Resource Tagging Strategy.

Required: No

Type: Array of Tag

Maximum: 50

Update requires: No interruption

TieredStorageConfig

The configuration for managed tier checkpointing on the HyperPod cluster. When enabled, this feature uses a multi-tier storage approach for storing model checkpoints, providing faster checkpoint operations and improved fault tolerance across cluster nodes.

Required: No

Type: TieredStorageConfig

Update requires: No interruption

VpcConfig

Specifies an Amazon Virtual Private Cloud (VPC) that your SageMaker jobs, hosted models, and compute resources have access to. You can control access to and from your resources by configuring a VPC. For more information, see Give SageMaker Access to Resources in your Amazon VPC.

Required: No

Type: VpcConfig

Update requires: Replacement

Return values

Ref

Fn::GetAtt

The Fn::GetAtt intrinsic function returns a value for a specified attribute of this type. The following are the available attributes and sample return values.

For more information about using the Fn::GetAtt intrinsic function, see Fn::GetAtt.

ClusterArn

The Amazon Resource Name (ARN) of the SageMaker HyperPod cluster.

ClusterStatus

The status of the SageMaker HyperPod cluster.

CreationTime

The time when the SageMaker HyperPod cluster is created.

FailureMessage

The failure message of the SageMaker HyperPod cluster.