SageMaker / Client / start_cluster_health_check

start_cluster_health_check

SageMaker.Client.start_cluster_health_check(**kwargs)

Start deep health checks for a SageMaker HyperPod cluster. You can use DescribeClusterNode API to track progress of the deep health checks. The unhealthy nodes will be automatically rebooted or replaced. Please see Resilience-related Kubernetes labels by SageMaker HyperPod for details.

See also: AWS API Documentation

Request Syntax

response = client.start_cluster_health_check(
    ClusterName='string',
    DeepHealthCheckConfigurations=[
        {
            'InstanceGroupName': 'string',
            'InstanceIds': [
                'string',
            ],
            'DeepHealthChecks': [
                'InstanceStress'|'InstanceConnectivity',
            ]
        },
    ]
)
Parameters:
  • ClusterName (string) –

    [REQUIRED]

    The string name or the Amazon Resource Name (ARN) of the SageMaker HyperPod cluster.

  • DeepHealthCheckConfigurations (list) –

    [REQUIRED]

    A list of configurations containing instance group names, EC2 instance IDs, and deep health checks to perform.

    • (dict) –

      The configuration of deep health checks for an instance group.

      Note

      Overlapping deep health check configurations will be merged into a single operation.

      • InstanceGroupName (string) – [REQUIRED]

        The name of the instance group.

      • InstanceIds (list) –

        A list of Amazon Elastic Compute Cloud (EC2) instance IDs on which to perform deep health checks.

        Note

        Leave this field blank to perform deep health checks on the entire instance group.

        • (string) –

      • DeepHealthChecks (list) – [REQUIRED]

        A list of deep health checks to be performed.

        • (string) –

Return type:

dict

Returns:

Response Syntax

{
    'ClusterArn': 'string'
}

Response Structure

  • (dict) –

    • ClusterArn (string) –

      The Amazon Resource Name (ARN) of the SageMaker HyperPod cluster on which the deep health checks were initiated.

Exceptions

  • SageMaker.Client.exceptions.ResourceNotFound