Manually quarantine, replace, or reboot a node
Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.
To quarantine a node and force delete a training pod
kubectl cordon<node-name>
After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck
in termination for more than 30min or kubectl describe pod shows ‘Node is
not ready’ in Events
kubectl delete pods<pod-name>--grace-period=0 --force
To replace a node
Label the node to replace with
sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement,
which triggers the SageMaker HyperPod Automatic node
recovery. Note that you also need
to activate automatic node recovery during cluster creation or update.
kubectl label nodes<node-name>\ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
To reboot a node
Label the node to reboot with
sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot,
which triggers the SageMaker HyperPod Automatic node
recovery. Note that you also need
to activate automatic node recovery during cluster creation or update.
kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
After the labels UnschedulablePendingReplacement or
UnschedulablePendingReboot are applied, you should be able to see the
node is terminated or reboot in few minutes.