手动隔离、替换或重启节点

了解如何在 Amazon EKS 编排的 SageMaker HyperPod集群中手动隔离、替换和重启故障节点。

要隔离节点并强制删除训练容器组（pod）


kubectl cordon <node-name>

隔离后，强制弹出容器组（pod）如果发现容器组（pod）在终止状态下停留超过 30 分钟，或者 kubectl describe pod 在事件中显示“节点未准备就绪”，就可以使用此功能。


kubectl delete pods <pod-name> --grace-period=0 --force

SageMaker HyperPod 提供了两种手动恢复节点的方法。首选方法是使用 R SageMaker HyperPod eboot and Replace APIs，它提供了更快、更透明的恢复流程，适用于所有协调器。或者，您可以使用 kubectl 命令标记节点以进行重启和替换操作。这两种方法都激活相同的 SageMaker HyperPod 恢复过程。

使用重启 API 重启节点

要重启节点，可以使用 BatchRebootClusterNodes API。

以下是使用以下方法在两个集群实例上运行重启操作的示例AWS Command Line Interface：


 aws sagemaker-dev batch-reboot-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-abc123 i-def456

使用替换 API 替换节点

要替换节点，您可以按如下方式使用 BatchReplaceClusterNodes API

以下是使用以下方法对集群的两个实例运行替换操作的示例AWS Command Line Interface：


 aws sagemaker-dev batch-replace-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-abc123 i-def456

使用 kubectl 替换节点

标记要替换的节点sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement，这会触发 SageMaker HyperPod 自动节点恢复。请注意，您还需要在创建或更新集群时激活节点自动恢复功能。


kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement

使用 kubectl 重启节点

标记要重启的节点sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot，这会触发 SageMaker HyperPod 自动节点恢复。请注意，您还需要在创建或更新集群时激活节点自动恢复功能。


kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot

应用标签UnschedulablePendingReplacement或UnschedulablePendingReboot后，您应该能够在几分钟内看到节点已终止或重新启动。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

韧性相关的 Kubernetes 标签

建议的弹性配置