安装配方 CLI 连接到集群启动训练作业查看作业状态监控作业日志列出所有活跃作业取消作业运行评估作业常见问题

Amazon SageMaker HyperPod 核心命令指南

Amazon SageMaker HyperPod 提供丰富的命令行功能，用于管理训练工作流。本指南涵盖常见操作的核心命令，包括连接集群、监控作业进度。

先决条件

使用这些命令前，请确保已完成以下配置：

已创建配备 RIG 的 SageMaker HyperPod 集群（通常在 us-east-1）
已创建用于存储训练构件的 Amazon S3 输出存储桶
已配置具备相应权限的 IAM 角色
已按正确的 JSONL 格式上传训练数据
已完成 FSx for Lustre 同步（首次作业时请在集群日志中验证）

安装配方 CLI

执行安装命令前，请先导航到配方存储库根目录。

若使用非 Forge 自定义方案，请使用 Hyperpodrecipes 存储库；若基于 Forge 进行自定义，请参考专属 Forge 的配方存储库。

执行以下命令安装 SageMaker HyperPod CLI：

注意

确保并未处于活跃的 conda/anaconda/miniconda 环境或其他虚拟环境中

若当前处于这样的环境中，请执行以下命令退出：

conda/anaconda/miniconda 环境：conda deactivate
Python 虚拟环境：deactivate

若使用的是非 Forge 自定义技术，请按如下方式下载 sagemaker-hyperpod-recipes 存储库：


git clone -b release_v2 https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli
pip install -e .
cd ..
root_dir=$(pwd)
export PYTHONPATH=${root_dir}/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

若为 Forge 订阅用户，您应按照如下流程下载配方。


mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

提示

如需在执行 pip install -e . 前使用新的虚拟环境，请运行：

python -m venv nova_forge
source nova_forge/bin/activate
此时命令行提示符开头会显示 (nova_forge)
这可确保使用 CLI 时不会出现冲突的依赖项

目的：为何要执行 pip install -e .？

该命令以可编辑模式安装 SageMaker HyperPod CLI，无需您每次重新安装，即可使用更新后的配方。其同时支持新增配方并被 CLI 自动识别。

连接到集群

运行任何任务前，请先使用 SageMaker HyperPod CLI 连接集群：


export AWS_REGION=us-east-1 &&  SageMaker HyperPod  connect-cluster --cluster-name <your-cluster-name> --region us-east-1

重要

该命令会生成后续命令必需的上下文文件 (/tmp/hyperpod_context.json)。若出现文件不存在错误，请重新执行连接命令。

专业提示：您可以通过在命令中添加 --namespace kubeflow 参数，进一步将集群配置为始终使用 kubeflow 命名空间，示例如下：


export AWS_REGION=us-east-1 && \
hyperpod connect-cluster \
--cluster-name <your-cluster-name> \
--region us-east-1 \
--namespace kubeflow

这样就无需在每条任务命令中都添加 -n kubeflow。

启动训练作业

注意

若运行 PPO/RFT 任务，请务必在文件 src/hyperpod_cli/sagemaker_hyperpod_recipes/recipes_collection/cluster/k8s.yaml 中添加标签选择器配置，确保所有容器组（pod）调度到同一节点。


label_selector:
  required:
    sagemaker.amazonaws.com/instance-group-name:
      - <rig_group>

通过配方启动训练作业（支持可选参数覆盖）：


hyperpod start-job -n kubeflow \
--recipe fine-tuning/nova/nova_1_0/nova_micro/SFT/nova_micro_1_0_p5_p4d_gpu_lora_sft \
--override-parameters '{
"instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest"
  }'

预期输出：


Final command: python3 <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=fine-tuning/nova/nova_micro_p5_gpu_sft cluster_type=k8s cluster=k8s base_results_dir=/local/home/<username>/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" container="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:HP-SFT-DATAMIX-latest"

Prepared output directory at /local/home/<username>/results/<job-name>/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/<username>/results/<job-name>/<job-name>_launch.sh
Running Helm script: /local/home/<username>/results/<job-name>/<job-name>_launch.sh

NAME: <job-name>
LAST DEPLOYED: Mon Sep 15 20:56:50 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/<your-cluster-name>"
}

查看作业状态

使用 kubectl 监控运行中的作业：


kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep <your-job-name>)

容器组（pod）状态说明

下表说明常见的容器组（pod）状态：

Status	说明
`Pending`	容器组（pod）已被接受，但尚未调度到节点，或正在等待拉取容器映像
`Running`	容器组（pod）已绑定到节点，且至少有一个容器正在运行或启动中
`Succeeded`	所有容器均成功完成，且不会重启
`Failed`	所有容器已终止，且至少有一个容器执行失败
`Unknown`	无法确定容器组（pod）状态（通常因节点通信问题导致）
`CrashLoopBackOff`	容器反复启动失败；Kubernetes 暂缓重启尝试（退避重试）
`ImagePullBackOff` / `ErrImagePull`	无法从注册表拉取容器映像
`OOMKilled`	容器因超出内存限制被终止
`Completed`	作业或容器组（pod）成功完成（批处理作业完成）

提示

使用 -w 参数可实时监控容器组（pod）状态更新，按下 Ctrl+C 可停止监控。按下 Ctrl+C 可停止监控。

监控作业日志

您可通过以下三种方式之一查看日志：

使用 CloudWatch

日志存储在包含 HyperPod 集群的 AWS 账户对应的 CloudWatch 中。如需在浏览器中查看，请导航到账户中的 CloudWatch 主页，然后搜索集群名称。例如，若集群名为 my-hyperpod-rig，则日志组的前缀为：

日志组：/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}
进入日志组后，可通过节点实例 ID（如 hyperpod-i-00b3d8a1bf25714e4）定位具体日志。
- 其中 i-00b3d8a1bf25714e4 表示当前训练作业所运行的 HyperPod 节点友好名称。回想一下，在之前执行的命令 kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-cpt-run) 的输出中，我们已经获取到了名为 NODE 的列。
- 本例中，“主”节点的作业运行在 hyperpod-i-00b3d8a1bf25714e4 上，因此需通过该字符串筛选日志组。选择名称为 SagemakerHyperPodTrainingJob/rig-group/[NODE] 的日志流即可查看

使用 CloudWatch Insights

若已掌握作业名称，且不想执行上述全部步骤，可直接在 /aws/sagemaker/Clusters/my-hyperpod-rig/{UUID} 下查询所有日志，以定位对应日志。

CPT：


fields @timestamp, @message, @logStream, @log
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc
| limit 100

如需查询作业完成状态，可将 Starting CPT Job 替换为 CPT Job completed

然后，您可以单击结果，并选择显示“Epoch 0”的结果，因为那将是主节点。

使用 AWS AWS CLI

您也可以通过 AWS CLI 实时跟踪日志。操作前，请先使用 aws --version 检查 AWS CLI 版本。建议使用提供的工具脚本，以便在终端中实时跟踪日志

V1：


aws logs get-log-events \
--log-group-name /aws/sagemaker/YourLogGroupName \
--log-stream-name YourLogStream \
--start-from-head | jq -r '.events[].message'

V2：


aws logs tail /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
--since 10m \
--follow

列出所有活跃作业

查看集群中所有正在运行的作业：


hyperpod list-jobs -n kubeflow

输出示例：


{
  "jobs": [
    {
      "Name": "test-run-nhgza",
      "Namespace": "kubeflow",
      "CreationTime": "2025-10-29T16:50:57Z",
      "State": "Running"
    }
  ]
}

取消作业

随时停止正在运行的作业：


hyperpod cancel-job --job-name <job-name> -n kubeflow

查找作业名称

选项 1：从配方中获取

作业名称已在配方的 run 代码块中指定：


run:
  name: "my-test-run"                        # This is your job name
  model_type: "amazon.nova-micro-v1:0:128k"
  ...

选项 2：通过 list-jobs 命令获取

执行 hyperpod list-jobs -n kubeflow 命令，复制输出结果中的 Name 字段即可。

运行评估作业

使用评估配方对已训练模型或基础模型进行评估。

先决条件

运行评估作业前，确保已准备好以下内容：

来自训练作业的 manifest.json 文件的检查点 Amazon S3 URI（适用于经过训练的模型）
已按正确格式上传到 Amazon S3 的评估数据集
用于存储评估结果的 Amazon S3 输出路径

命令

执行以下命令启动评估作业：


hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'

参数描述：

recipes.run.name：评估作业的唯一名称
recipes.run.model_name_or_path：来自 manifest.json 的 Amazon S3 URI，或基础模型路径（如 nova-micro/prod）
recipes.run.output_s3_path：评估结果的 Amazon S3 存储位置
recipes.run.data_s3_path：评估数据集的 Amazon S3 存储位置

提示：

模型专属配方：不同规格的模型（micro、lite、pro）对应各自的评估配方
基础模型评测：评估基础模型时，使用基础模型路径（如 nova-micro/prod）而非检查点 URI

评估数据格式

输入格式（JSONL）：


{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}

输出格式：


{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}

字段描述：

prompt：发送给模型的格式化输入内容
inference：模型生成的响应
gold：输入数据集中的预期正确答案
metadata：从输入中传递的可选元数据

常见问题

nemo_launcher 您可能需要根据 hyperpod_cli 的安装路径，将 ModuleNotFoundError: No module named 'nemo_launcher' 添加到 Python 路径中。示例命令：


export PYTHONPATH=<path_to_hyperpod_cli>/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/hyperpod_current_context.json' 说明您未执行 hyperpod connect cluster 命令。
若未看到作业被调度，请检查 SageMaker HyperPod CLI 的输出中是否包含任务名称及其他元数据。若未包含，请执行以下命令重新安装 Helm 图表：
```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

Nova Forge SDK

HP 集群设置