GPU Kubernetes 環境設定使用 SageMaker HyperPod CLI 啟動訓練任務使用配方啟動器啟動訓練任務

Kubernetes 叢集預先訓練教學課程 (GPU)

有兩種方式可在 GPU Kubernetes 叢集中啟動訓練任務：

(建議) HyperPod 命令列工具
NeMo 樣式啟動器

先決條件

開始設定環境之前，請確定您具有下列先決條件：

HyperPod GPU Kubernetes 叢集已正確設定。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：
- JSON
- JSONGZ (壓縮 JSON)
- ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱使用者存取權杖。

GPU Kubernetes 環境設定

若要設定 GPU Kubernetes 環境，請執行下列動作：

設定虛擬環境。請確定您使用的是 Python 3.9 或更新版本。
```
python3 -m venv ${PWD}/venv
source venv/bin/activate
```

使用下列其中一種方法安裝相依性：

(建議)：HyperPod 命令列工具方法：


# install HyperPod command line tools
git clone https://github.com/aws/sagemaker-hyperpod-cli
cd sagemaker-hyperpod-cli
pip3 install .

SageMaker HyperPod 配方方法：


# install SageMaker HyperPod Recipes.
git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

設定 kubectl 和 eksctl
安裝 Helm

連線至您的 Kubernetes 叢集


aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]

使用 SageMaker HyperPod CLI 啟動訓練任務

建議使用 SageMaker HyperPod 命令列介面 (CLI) 工具，搭配您的組態提交訓練任務。下列範例會提交 hf_llama3_8b_seq16k_gpu_p5x16_pretrain 模型的訓練任務。

your_training_container：深度學習容器若要尋找 SMP 容器的最新版本，請參閱 SageMaker 模型平行化程式庫的版本備註。
(選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：
```
"recipes.model.hf_access_token": "<your_hf_token>"
```


hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。


kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s

如果 STATUS 是 PENDING 或 ContainerCreating，請執行下列命令以取得詳細資訊。


kubectl describe pod name_of_pod

在任務 STATUS 變更為 Running 之後，您可以使用下列命令來檢查日誌。


kubectl logs name_of_pod

STATUS 會在您執行 kubectl get pods 時變成 Completed。

使用配方啟動器啟動訓練任務

或者，您可以使用 SageMaker HyperPod 配方來提交訓練任務。使用配方涉及更新 k8s.yaml、config.yaml 和執行啟動指令碼。

在 k8s.yaml 中，更新 persistent_volume_claims。它會將 Amazon FSx 宣告掛載到每個運算 Pod 的 /data 目錄
```
persistent_volume_claims:
  - claimName: fsx-claim
    mountPath: data
```

在 config.yaml 中，更新 git 下的 repo_url_or_path。


git:
  repo_url_or_path: <training_adapter_repo>
  branch: null
  commit: null
  entry_script: null
  token: null

更新 launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh

your_contrainer：深度學習容器若要尋找 SMP 容器的最新版本，請參閱 SageMaker 模型平行化程式庫的版本備註。
(選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：
```
recipes.model.hf_access_token=<your_hf_token>
```


#!/bin/bash
#Users should setup their cluster type in /recipes_collection/config.yaml
REGION="<region>"
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
VAL_DIR="<your_val_data_dir>" # Location of talidation dataset

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.model.data.train_dir=$TRAIN_DIR \
    recipes.model.data.val_dir=$VAL_DIR

啟動訓練任務


bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。


kubectl get pods


NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s

如果 STATUS 是 PENDING 或 ContainerCreating，請執行下列命令以取得詳細資訊。


kubectl describe pod <name-of-pod>

在任務 STATUS 變更為 Running 之後，您可以使用下列命令來檢查日誌。


kubectl logs name_of_pod

當您執行 Completed 時，STATUS 會變成 kubectl get pods。

如需 k8s 叢集組態的詳細資訊，請參閱在 HyperPod k8s 上執行訓練任務。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用 Slurm 叢集進行 PEFT-Lora 訓練

使用 Kubernetes 叢集進行 Trainium 預先訓練