建立 NodeClass

重要

您必須在執行個體群組中從 0 個節點開始，並讓 Karpenter 處理自動擴展。如果您從超過 0 個的節點開始，Karpenter 會將它們縮減為 0。

節點類別 (NodeClass) 定義套用至 Amazon EKS 叢集中節點群組的基礎設施層級設定，包括網路組態、儲存設定和資源標記。HyperPodNodeClass 是對應至 SageMaker HyperPod 中預先建立執行個體群組的自訂 NodeClass，其中定義 Karpenter 自動擴展決策支援哪些執行個體類型和可用區域的限制。

建立節點類別的考量

您可以在 NodeClass 中指定最多 10 個執行個體群組。
搭配 MIG （多執行個體 GPU) 使用 GPU 分割時，Karpenter 可以自動佈建已啟用 MIG 執行個體群組的節點。確保您的執行個體群組包含 MIG 支援的執行個體類型 (ml.p4d.24xlarge、ml.p5.48xlarge 或 ml.p5e/p5en.48xlarge)，並在叢集建立期間設定適當的 MIG 標籤。如需設定 GPU 分割的詳細資訊，請參閱在 Amazon SageMaker HyperPod 中使用 GPU 分割區。
如果自訂標籤套用至執行個體群組，您可以在查詢HyperpodNodeClass狀態時於 desiredLabels 欄位中檢視它們。這包括 MIG 組態標籤，例如 nvidia.com/mig.config。當傳入任務請求 MIG 資源時，Karpenter 會自動擴展已套用適當 MIG 標籤的執行個體。
如果您選擇刪除執行個體群組，建議您先將其從 NodeClass 中移除，再從 HyperPod 叢集中刪除。如果在 NodeClass 中使用執行個體群組時將其刪除，則 NodeClass 會標示為未 Ready 進行佈建，且不會用於後續擴展操作，直到執行個體群組從 NodeClass 中移除為止。
當您從 NodeClass 移除執行個體群組時，Karpenter 會在執行個體群組中偵測到由 Karpenter 管理的節點上出現漂移，並根據您的中斷預算控制中斷節點。
執行個體群組使用的子網路應屬於相同的 AZ。子網路是在執行個體群組層級或叢集層級使用 OverrideVpcConfig 來指定。預設會使用 VpcConfig。
目前僅支援隨需容量。不支援具有訓練計畫或預留容量的執行個體群組。
不支援具有 DeepHealthChecks (DHC) 的執行個體群組。這是因為 DHC 大約需要 60-90 分鐘才能完成，在此期間 Pod 將保持待定狀態，這可能會導致過度佈建。

下列步驟涵蓋如何建立 NodeClass。

使用您的 NodeClass 組態建立 YAML 檔案 (例如 nodeclass.yaml)。
使用 kubectl 將組態套用至您的叢集。
參考 NodePool 組態中的 NodeClass。

以下是使用 ml.c5.xlarge 和 ml.c5.4xlarge 執行個體類型的範例 NodeClass：


apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  name: sample-nc
spec:
  instanceGroups:
    # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
    # MaxItems: 10
    - auto-c5-xaz1
    - auto-c5-4xaz2

套用組態：
```
kubectl apply -f nodeclass.yaml
```

監控 NodeClass 狀態，以確保狀態中的 Ready 條件設定為 True：


kubectl get hyperpodnodeclass sample-nc -o yaml


apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  creationTimestamp: "<timestamp>"
  name: sample-nc
  uid: <resource-uid>
spec:
  instanceGroups:
  - auto-c5-az1
  - auto-c5-4xaz2
status:
  conditions:
  // true when all IGs in the spec are present in SageMaker cluster, false otherwise
  - lastTransitionTime: "<timestamp>"
    message: ""
    observedGeneration: 3
    reason: InstanceGroupReady
    status: "True"
    type: InstanceGroupReady
  // true if subnets of IGs are discoverable, false otherwise
  - lastTransitionTime: "<timestamp>"
    message: ""
    observedGeneration: 3
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  // true when all dependent resources are Ready [InstanceGroup, Subnets]
  - lastTransitionTime: "<timestamp>"
    message: ""
    observedGeneration: 3
    reason: Ready
    status: "True"
    type: Ready
  instanceGroups:
  - desiredLabels:
    - key: <custom_label_key>
      value: <custom_label_value>
    - key: nvidia.com/mig.config
      value: all-1g.5gb
    instanceTypes:
    - ml.c5.xlarge
    name: auto-c5-az1
    subnets:
    - id: <subnet-id>
      zone: <availability-zone-a>
      zoneId: <zone-id-a>
  - instanceTypes:
    - ml.c5.4xlarge
    name: auto-c5-4xaz2
    subnets:
    - id: <subnet-id>
      zone: <availability-zone-b>
      zoneId: <zone-id-b>

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

建立 HyperPod 叢集

建立 NodePool