SageMaker HyperPod cluster management - Amazon SageMaker AI

SageMaker HyperPod cluster management

The following topics discuss logging and managing SageMaker HyperPod clusters.

Logging SageMaker HyperPod events

All events and logs from SageMaker HyperPod are saved to Amazon CloudWatch under the log group name /aws/sagemaker/Clusters/[ClusterName]/[ClusterID]. Every call to the CreateCluster API creates a new log group. The following list contains all of the available log streams collected in each log group.

Log Group Name Log Stream Name
/aws/sagemaker/Clusters/[ClusterName]/[ClusterID] LifecycleConfig/[instance-group-name]/[instance-id]

Logging SageMaker HyperPod at instance level

You can access the LifecycleScript logs published to CloudWatch during cluster instance configuration. Every instance within the created cluster generates a separate log stream, distinguishable by the LifecycleConfig/[instance-group-name]/[instance-id] format.

All logs that are written to /var/log/provision/provisioning.log are uploaded to the preceding CloudWatch stream. Sample LifecycleScripts at 1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config redirect their stdout and stderr to this location. If you are using your custom scripts, write your logs to the /var/log/provision/provisioning.log location for them to be available in CloudWatch.

Lifecycle script log markers

CloudWatch logs for lifecycle scripts include specific markers to help you track execution progress and identify issues:

Marker Description
START Indicates the beginning of lifecycle script execution on the instance
[SageMaker] Lifecycle scripts were provided, with S3 uri: [s3://bucket-name/] and entrypoint script: [script-name.sh] Indicates the S3 location and entrypoint script that will be used
[SageMaker] Downloading lifecycle scripts Indicates scripts are being downloaded from the specified S3 location
[SageMaker] Lifecycle scripts have been downloaded Indicates scripts have been successfully downloaded from S3
[SageMaker] The lifecycle scripts succeeded Indicates successful completion of all lifecycle scripts
[SageMaker] The lifecycle scripts failed Indicates failed execution of lifecycle scripts

These markers help you quickly identify where in the lifecycle script execution process an issue occurred. When troubleshooting failures, review the log entries to identify where the process stopped or failed.

Lifecycle script failure messages

If the lifecycle script exists but fails during execution, you will receive an error message that includes the CloudWatch log group name and log stream name. In the event that there are lifecycle script failures across multiple instances, the error message will indicate only one failed instance, but the log group should contain streams for all instances.

You can view the error message by running the DescribeCluster API or by viewing the cluster details page in the SageMaker console. In the console, a View lifecycle script logs button is provided that navigates directly to the CloudWatch log stream. The error message has the following format:

Instance [instance-id] failed to provision with the following error: "Lifecycle scripts did not run successfully. To view lifecycle script logs, visit log group ‘/aws/sagemaker/Clusters/[cluster-name]/[cluster-id]' and log stream ‘LifecycleConfig/[instance-group-name]/[instance-id]’. If you cannot find corresponding lifecycle script logs in CloudWatch, please make sure you follow one of the options here: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-faq-slurm.html#hyperpod-faqs-q1.” Note that multiple instances may be impacted.

Tagging resources

AWS Tagging system helps manage, identify, organize, search for, and filter resources. SageMaker HyperPod supports tagging, so you can manage the clusters as an AWS resource. During cluster creation or editing an existing cluster, you can add or edit tags for the cluster. To learn more about tagging in general, see Tagging your AWS resources.

Using the SageMaker HyperPod console UI

When you are creating a new cluster and editing a cluster, you can add, remove, or edit tags.

Using the SageMaker HyperPod APIs

When you write a CreateCluster or UpdateCluster API request file in JSON format, edit the Tags section.

Using the AWS CLI tagging commands for SageMaker AI

To tag a cluster

Use aws sagemaker add-tags as follows.

aws sagemaker add-tags --resource-arn cluster_ARN --tags Key=string,Value=string

To untag a cluster

Use aws sagemaker delete-tags as follows.

aws sagemaker delete-tags --resource-arn cluster_ARN --tag-keys "tag_key"

To list tags for a resource

Use aws sagemaker list-tags as follows.

aws sagemaker list-tags --resource-arn cluster_ARN