安裝配方 CLI 連線至叢集開始訓練任務檢查任務狀態監控任務日誌列出作用中任務取消任務執行評估任務常見問題

Amazon SageMaker HyperPod 基本命令指南

Amazon SageMaker HyperPod 提供廣泛的命令列功能來管理訓練工作流程。本指南涵蓋從連線至叢集到監控任務進度等常見操作的基本命令。

先決條件

使用這些命令之前，請確定您已完成下列設定：

建立 RIG 的 SageMaker HyperPod 叢集（通常在 us-east-1 中）
為訓練成品建立的輸出 Amazon S3 儲存貯體
已設定適當許可的 IAM 角色
以正確的 JSONL 格式上傳的訓練資料
FSx for Lustre 同步已完成（在第一個任務的叢集日誌中驗證）

安裝配方 CLI

在執行安裝命令之前，導覽至配方儲存庫的根目錄。

如果使用非 Forge 自訂技術，請使用 Hyperpodrecipes 儲存庫，對於 Forge 型自訂，請參閱 forge 特定配方儲存庫。

執行下列命令來安裝 SageMaker HyperPod CLI：

注意

確保您不在作用中的 conda / anaconda / miniconda 環境或其他虛擬環境中

如果是，請使用結束環境：

conda deactivate 適用於 conda / anaconda / miniconda 環境
deactivate 適用於 Python 虛擬環境

如果您使用的是非 Forge 自訂技術，請下載 sagemaker-hyperpod-recipes，如下所示：


git clone -b release_v2 https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli
pip install -e .
cd ..
root_dir=$(pwd)
export PYTHONPATH=${root_dir}/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

如果您是 Forge 訂閱者，您應該使用下列程序下載配方。


mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

提示

若要在執行之前使用新的虛擬環境pip install -e .，請執行：

python -m venv nova_forge
source nova_forge/bin/activate
您的命令列現在會在提示開頭顯示 (nova_forge)
這可確保使用 CLI 時沒有相互競爭的相依性

目的：為什麼要執行 pip install -e . ？

此命令會以可編輯模式安裝 SageMaker HyperPod CLI，可讓您使用更新的配方，而無需每次重新安裝。它也可讓您新增 CLI 可自動取得的新配方。

連線至叢集

在執行任何任務之前，將 SageMaker HyperPod CLI 連接至您的叢集：


export AWS_REGION=us-east-1 &&  hyperpod connect-cluster --cluster-name <your-cluster-name> --region us-east-1

重要

此命令會建立後續命令所需的內容檔案 (/tmp/hyperpod_context.json)。如果您看到找不到此檔案的錯誤，請重新執行 connect 命令。

專業秘訣：您可以將 --namespace kubeflow引數新增至命令，進一步將叢集設定為一律使用 kubeflow 命名空間，如下所示：


export AWS_REGION=us-east-1 && \
hyperpod connect-cluster \
--cluster-name <your-cluster-name> \
--region us-east-1 \
--namespace kubeflow

這可讓您在與任務互動時，在每個命令-n kubeflow中新增。

開始訓練任務

注意

如果執行 PPO/RFT 任務，請確保您將標籤選擇器設定新增至，src/hyperpod_cli/sagemaker_hyperpod_recipes/recipes_collection/cluster/k8s.yaml以便所有 Pod 都排程在相同的節點上。


label_selector:
  required:
    sagemaker.amazonaws.com/instance-group-name:
      - <rig_group>

使用具有選用參數覆寫的配方啟動訓練任務：


hyperpod start-job -n kubeflow \
--recipe fine-tuning/nova/nova_1_0/nova_micro/SFT/nova_micro_1_0_p5_p4d_gpu_lora_sft \
--override-parameters '{
"instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest"
  }'

預期的輸出：


Final command: python3 <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=fine-tuning/nova/nova_micro_p5_gpu_sft cluster_type=k8s cluster=k8s base_results_dir=/local/home/<username>/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" container="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:HP-SFT-DATAMIX-latest"

Prepared output directory at /local/home/<username>/results/<job-name>/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/<username>/results/<job-name>/<job-name>_launch.sh
Running Helm script: /local/home/<username>/results/<job-name>/<job-name>_launch.sh

NAME: <job-name>
LAST DEPLOYED: Mon Sep 15 20:56:50 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/<your-cluster-name>"
}

檢查任務狀態

使用 kubectl 監控執行中的任務：


kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep <your-job-name>)

了解 Pod 狀態

下表說明常見的 Pod 狀態：

狀態	說明
`Pending`	Pod 已接受但尚未排程到節點，或等待提取容器映像
`Running`	Pod 繫結至節點，其中至少有一個容器正在執行或啟動
`Succeeded`	所有容器都已成功完成，且不會重新啟動
`Failed`	終止的所有容器至少有一個結尾為失敗的容器
`Unknown`	無法判斷 Pod 狀態（通常是由於節點通訊問題）
`CrashLoopBackOff`	容器重複失敗；Kubernetes 從重新啟動嘗試中退避
`ImagePullBackOff` / `ErrImagePull`	無法從登錄檔提取容器映像
`OOMKilled`	容器因超過記憶體限制而終止
`Completed`	任務或 Pod 已成功完成（批次任務完成）

提示

使用 -w旗標即時觀看 Pod 狀態更新。按 Ctrl+C 停止觀看。

監控任務日誌

您可以透過以下三種方式之一來檢視日誌：

使用 CloudWatch

您的日誌可在 AWS 您的帳戶中使用，其中包含 CloudWatch 下的 Hyperpodcluster。若要在瀏覽器中檢視它們，請導覽至您帳戶中的 CloudWatch 首頁，並搜尋您的叢集名稱。例如，如果您的叢集被呼叫my-hyperpod-rig，則日誌群組會有字首：

日誌群組： /aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}
進入日誌群組後，您可以使用節點執行個體 ID 來尋找您的特定日誌，例如 - hyperpod-i-00b3d8a1bf25714e4。
- i-00b3d8a1bf25714e4 這裡代表訓練任務執行所在的 Hyperpodfriendly 機器名稱。回想在先前的命令kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-cpt-run)輸出中，我們擷取稱為 NODE 的資料欄的方式。
- 在這種情況下，「主要」節點執行是在 Hyperpod-i-00b3d8a1bf25714e4 上執行，因此我們將使用該字串來選取要檢視的日誌群組。選取顯示 SagemakerHyperPodTrainingJob/rig-group/[NODE]

使用 CloudWatch Insights

如果您的任務名稱方便使用，但不希望完成上述所有步驟，您可以直接查詢下的所有日誌/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}，以尋找個別日誌。

CPT：


fields @timestamp, @message, @logStream, @log
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc
| limit 100

若要完成任務，請將取代Starting CPT Job為 CPT Job completed

然後，您可以按一下結果並挑選「Epoch 0」，因為那將是您的主節點。

使用 AWS CLI

您可以選擇使用結尾您的日誌 AWS CLI。執行此操作之前，請使用檢查您的 AWS CLI 版本aws --version。也建議您使用此公用程式指令碼，以協助追蹤終端機中的即時日誌

適用於 V1：


aws logs get-log-events \
--log-group-name /aws/sagemaker/YourLogGroupName \
--log-stream-name YourLogStream \
--start-from-head | jq -r '.events[].message'

適用於 V2：


aws logs tail /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
--since 10m \
--follow

列出作用中任務

檢視叢集中執行的所有任務：


hyperpod list-jobs -n kubeflow

輸出範例：


{
  "jobs": [
    {
      "Name": "test-run-nhgza",
      "Namespace": "kubeflow",
      "CreationTime": "2025-10-29T16:50:57Z",
      "State": "Running"
    }
  ]
}

取消任務

隨時停止執行中的任務：


hyperpod cancel-job --job-name <job-name> -n kubeflow

尋找您的任務名稱

選項 1：從您的配方

任務名稱是在配方的 run 區塊中指定：


run:
  name: "my-test-run"                        # This is your job name
  model_type: "amazon.nova-micro-v1:0:128k"
  ...

選項 2：從 list-jobs 命令

使用 hyperpod list-jobs -n kubeflow 並從輸出複製 Name 欄位。

執行評估任務

使用評估配方評估訓練過的模型或基礎模型。

先決條件

在執行評估任務之前，請確定您有：

來自訓練任務manifest.json檔案的檢查點 Amazon S3 URI （適用於訓練模型）
以正確格式上傳至 Amazon S3 的評估資料集
用於評估結果的輸出 Amazon S3 路徑

命令

執行下列命令來啟動評估任務：


hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'

參數描述：

recipes.run.name：評估任務的唯一名稱
recipes.run.model_name_or_path：來自 manifest.json或基礎模型路徑的 Amazon S3 URI （例如 nova-micro/prod)
recipes.run.output_s3_path：評估結果的 Amazon S3 位置
recipes.run.data_s3_path：評估資料集的 Amazon S3 位置

提示：

特定模型配方：每個模型大小（微型、精簡型、專業型）都有自己的評估配方
基礎模型評估：使用基礎模型路徑（例如 nova-micro/prod) 而非檢查點 URIs來評估基礎模型

評估資料格式

輸入格式 (JSONL)：


{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}

輸出格式：


{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}

欄位描述：

prompt：傳送至模型的格式化輸入
inference：模型產生的回應
gold：輸入資料集的預期正確答案
metadata：從輸入傳遞的選用中繼資料

常見問題

ModuleNotFoundError: No module named 'nemo_launcher'，您可能需要根據hyperpod_cli安裝的位置nemo_launcher，將新增至您的 python 路徑。範例命令：


export PYTHONPATH=<path_to_hyperpod_cli>/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/hyperpod_current_context.json' 表示您錯過執行 Hyperpod Connect 叢集命令。
如果您沒有看到任務排程，請仔細檢查 SageMaker HyperPod CLI 的輸出是否具有包含任務名稱和其他中繼資料的本節。如果沒有，請執行下列動作來重新安裝 helm Chart：
```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

Nova Forge SDK

HP 叢集設定