Node Lifecycle and Labels - Amazon SageMaker AI

Node Lifecycle and Labels

Amazon SageMaker HyperPod performs deep health checks on cluster instances during the creation and update of HyperPod clusters before GPU partitioning begins. HyperPod health-monitoring agent continuously monitors the health status of GPU partitioned instances.

MIG Configuration States

Nodes with GPU partition configuration go through several states:

  • Pending - Node is being configured with a MIG profile

  • Configuring - GPU Operator is applying MIG partitioning

  • Success - GPU partitioning completed successfully

  • Failed - GPU partitioning encountered an error

Monitoring Node States

# Check node health status kubectl get nodes -l sagemaker.amazonaws.com/node-health-status=Schedulable # Monitor MIG configuration progress kubectl get node NODE_NAME -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}' # Check for configuration errors kubectl describe node NODE_NAME | grep -A 5 "Conditions:"

Custom Labels and Taints

You can manage MIG configuration with custom labels and taints to label your GPU partitions and apply them across instances:

{ "KubernetesConfig": { "Labels": { "nvidia.com/mig.config": "all-2g.10gb", "task-type": "inference", "environment": "production" }, "Taints": [ { "Key": "gpu-task", "Value": "mig-enabled", "Effect": "NoSchedule" } ] } }