本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# SageMaker HyperPod 配方
<a name="sagemaker-hyperpod-recipes"></a>

Amazon SageMaker HyperPod 配方是由 提供的預先設定訓練堆疊 AWS ，可協助您從 Llama、Mistral、Mixtral 或 DeepSeek 等各種模型系列快速開始訓練和微調公開可用的基礎模型 (FMs)。這些配方可自動化端對端訓練迴圈，包括載入資料集、套用分散式訓練技術，以及管理檢查點以更快速地從故障中復原。

SageMaker HyperPod 配方特別有益於可能沒有深度機器學習專業知識的使用者，因為他們可簡化訓練大型模型所涉及的許多複雜性。

您可以在 SageMaker HyperPod 內或作為 SageMaker 訓練任務執行配方。

下列資料表保留在 SageMaker HyperPod GitHub 儲存庫中，並提供有關支援進行預先訓練和微調的模型、其各自配方及啟動指令碼，支援的執行個體類型等的最新資訊。
+ 如需支援進行預先訓練之模型、配方和啟動指令碼的最新清單，請參閱[預先訓練資料表](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#pre-training)。
+ 如需支援進行微調之模型、配方和啟動指令碼的最新清單，請參閱[微調資料表](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#fine-tuning)。

對於 SageMaker HyperPod 使用者，端對端訓練任務流程的自動化來自訓練轉接器與 SageMaker HyperPod 配方的整合。訓練轉接器是建置在 [NVIDIA NeMo 架構](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)和 [Neuronx 分散式訓練套件](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html)之上。如果您熟悉使用 NeMo，則使用訓練轉接器的程序相同。訓練轉接器會在您的叢集上執行配方。

![\[顯示 SageMaker HyperPod 配方工作流程的圖表。頂端的「配方」圖示會饋送至「HyperPod 配方啟動器」方塊。此方塊會連線到更大的區段，標記為「叢集：Slurm、K8s、..."，其中包含三個具有相關聯配方檔案的 GPU 圖示。叢集區段底部標記為「使用 HyperPod 訓練轉接器進行訓練」。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/sagemaker-hyperpod-recipes-overview.png)


您也可以定義自己的自訂配方來訓練自己的模型。

若要開始使用教學課程，請參閱[教學](sagemaker-hyperpod-recipes-tutorials.md)。

**Topics**
+ [

# 教學
](sagemaker-hyperpod-recipes-tutorials.md)
+ [

# 預設組態
](default-configurations.md)
+ [

# 叢集特定的組態
](cluster-specific-configurations.md)
+ [

# 考量事項
](cluster-specific-configurations-special-considerations.md)
+ [

# 進階設定
](cluster-specific-configurations-advanced-settings.md)
+ [

# 附錄
](appendix.md)

# 教學
<a name="sagemaker-hyperpod-recipes-tutorials"></a>

下列快速入門教學課程可協助您開始使用配方進行訓練：
+ SageMaker HyperPod 與 Slurm 協同運作
  + 預先訓練
    + [HyperPod Slurm 叢集預先訓練教學課程 (GPU)](hyperpod-gpu-slurm-pretrain-tutorial.md)
    + [Trainium Slurm 叢集預先訓練教學課程](hyperpod-trainium-slurm-cluster-pretrain-tutorial.md)
  + 微調
    + [HyperPod Slurm 叢集 PEFT-Lora 教學課程 (GPU)](hyperpod-gpu-slurm-peft-lora-tutorial.md)
    + [HyperPod Slurm 叢集 DPO 教學課程 (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md)
+ SageMaker HyperPod 與 K8s 協同運作
  + 預先訓練
    + [Kubernetes 叢集預先訓練教學課程 (GPU)](sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial.md)
    + [Trainium SageMaker 訓練任務預先訓練教學課程](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)
+ SageMaker 訓練任務
  + 預先訓練
    + [SageMaker 訓練任務預先訓練教學課程 (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)
    + [Trainium SageMaker 訓練任務預先訓練教學課程](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)

# HyperPod Slurm 叢集預先訓練教學課程 (GPU)
<a name="hyperpod-gpu-slurm-pretrain-tutorial"></a>

下列教學課程會設定 Slurm 環境，並在 Llama 80 億參數模型上啟動訓練任務。

**先決條件**  
開始設定您的環境以執行配方之前，請確定您已：  
設定 HyperPod GPU Slurm 叢集。  
您的 HyperPod Slurm 叢集必須啟用 Nvidia Enroot 和 Pyxis (這些項目預設為啟用)。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## HyperPod GPU Slurm 環境設定
<a name="hyperpod-gpu-slurm-environment-setup"></a>

若要在 HyperPod GPU Slurm 叢集上啟動訓練任務，請執行下列動作：

1. 對 Slurm 叢集的主節點執行 SSH。

1. 登入後，請設定虛擬環境。請確定您使用的是 Python 3.9 或更新版本。

   ```
   #set up a virtual environment
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. 將 SageMaker HyperPod 配方和 SageMaker HyperPod 轉接器儲存庫複製到共用儲存位置。

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
   git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
   cd sagemaker-hyperpod-recipes
   pip3 install -r requirements.txt
   ```

1. 使用 Enroot 建立 squash 檔案。若要尋找 SMP 容器的最新版本，請參閱 [SageMaker 模型平行化程式庫的版本備註](model-parallel-release-notes.md)。若要深入了解如何使用 Enroot 檔案，請參閱[建置 AWS最佳化 Nemo-Launcher 映像](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image)。

   ```
   REGION="<region>"
   IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
   aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
   enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
   mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
   ```

1. 若要使用 Enroot squash 檔案開始訓練，請使用下列範例來修改 `recipes_collection/config.yaml` 檔案。

   ```
   container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
   ```

## 啟動訓練任務
<a name="hyperpod-gpu-slurm-launch-training-job"></a>

在您安裝相依性之後，請從 `sagemaker-hyperpod-recipes/launcher_scripts` 目錄啟動訓練任務。您可以複製 [SageMaker HyperPod 配方儲存庫](https://github.com/aws/sagemaker-hyperpod-recipes)來取得相依性：

首先，從 Github 挑選您的訓練配方，模型名稱會指定為配方的一部分。在以下範例中，我們使用 `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` 指令碼搭配序列長度 8192 預先訓練配方 `llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain` 啟動 Llama 8b。
+ `IMAGE`：來自環境設定區段的容器。
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"
```

在啟動器指令碼中設定了所有必要參數之後，您可以使用下列命令執行指令碼。

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
```

如需 Slurm 叢集組態的詳細資訊，請參閱 [在 HyperPod Slurm 上執行訓練任務](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)。

# Trainium Slurm 叢集預先訓練教學課程
<a name="hyperpod-trainium-slurm-cluster-pretrain-tutorial"></a>

下列教學課程會在 Slurm 叢集上設定 Trainium 環境，並在 Llama 80 億參數模型上啟動訓練任務。

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
設定 SageMaker HyperPod Trainium Slurm 叢集。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## 在 Slurm 叢集上設定 Trainium 環境
<a name="hyperpod-trainium-slurm-cluster-pretrain-setup-trainium-environment"></a>

若要在 Slurm 叢集上啟動訓練任務，請執行下列動作：
+ 對 Slurm 叢集的主節點執行 SSH。
+ 登入後，請設定 Neuron 環境。如需設定 Neuron 的相關資訊，請參閱 [Neuron 設定步驟](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html#setting-up-the-environment)。我們建議依賴預先安裝 Neuron 驅動程式的深度學習 AMI，例如 [Ubuntu 20 搭配 DLAMI Pytorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html#setup-torch-neuronx-ubuntu20-dlami-pytorch)。
+ 將 SageMaker HyperPod 配方儲存庫複製到叢集中的共用儲存位置。共用儲存位置可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。

  ```
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ 完成下列教學課程：[HuggingFace Llama3-8B 預先訓練](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#)
+ 準備模型組態。Neuron 儲存庫中可用的模型組態。如需本教學課程中使用的模型組態，請參閱 [llama3 8b 模型組態](https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json)

## 在 Trainium 中啟動訓練任務
<a name="hyperpod-trainium-slurm-cluster-pretrain-launch-training-job-trainium"></a>

若要在 Trainium 中啟動訓練任務，請指定叢集組態和 Neuron 配方。例如，若要在 Trainium 中啟動 llama3 8b 預先訓練任務，請將啟動指令碼 `launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh` 設定為下列項目：
+ `MODEL_CONFIG`：來自環境設定區段的模型組態
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash

#Users should set up their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

COMPILE=0
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
MODEL_CONFIG="${MODEL_CONFIG}" # Location of config.json for the model

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    instance_type="trn1.32xlarge" \
    recipes.run.compile="$COMPILE" \
    recipes.run.name="hf-llama3-8b" \
    recipes.trainer.num_nodes=4 \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    recipes.data.train_dir="$TRAIN_DIR" \
    recipes.model.model_config="$MODEL_CONFIG"
```

若要啟動訓練任務，請執行下列命令：

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
```

如需 Slurm 叢集組態的詳細資訊，請參閱 [在 HyperPod Slurm 上執行訓練任務](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)。

# HyperPod Slurm 叢集 DPO 教學課程 (GPU)
<a name="hyperpod-gpu-slurm-dpo-tutorial"></a>

下列教學課程會設定 Slurm 環境，並在 Llama 80 億參數模型上啟動直接喜好設定最佳化 (DPO) 任務。

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
設定 HyperPod GPU Slurm 叢集  
您的 HyperPod Slurm 叢集必須啟用 Nvidia Enroot 和 Pyxis (這些項目預設為啟用)。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的記號化二進位喜好設定資料集：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您需要來自 HuggingFace 的預先訓練權重，或者如果您要訓練 Llama 3.2 模型，則您必須在開始訓練之前取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## 設定 HyperPod GPU Slurm 環境
<a name="hyperpod-gpu-slurm-dpo-hyperpod-gpu-slurm-environment"></a>

若要在 Slurm 叢集上啟動訓練任務，請執行下列動作：
+ 對 Slurm 叢集的主節點執行 SSH。
+ 登入後，請設定虛擬環境。請確定您使用的是 Python 3.9 或更新版本。

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ 將 SageMaker HyperPod 配方和 SageMaker HyperPod 轉接器儲存庫複製到共用儲存位置。共用儲存位置可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ 使用 Enroot 建立 squash 檔案。若要尋找 SMP 容器的最新版本，請參閱 [SageMaker 模型平行化程式庫的版本備註](model-parallel-release-notes.md)。如需使用 Enroot 檔案的詳細資訊，請參閱[建置最佳化 Nemo-Launcher AWS映像](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image)。

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ 若要使用 Enroot squash 檔案開始訓練，請使用下列範例來修改 `recipes_collection/config.yaml` 檔案。

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## 啟動訓練任務
<a name="hyperpod-gpu-slurm-dpo-launch-training-job"></a>

若要在單一 Slurm 運算節點上為序列長度為 8192 的 Llama 80 億參數模型啟動 DPO 任務，請將啟動指令碼 `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh` 設定為下列項目：
+ `IMAGE`：來自環境設定區段的容器。
+ `HF_MODEL_NAME_OR_PATH`：在配方的 hf\$1model\$1name\$1or\$1path 參數中定義預先訓練權重的名稱或路徑。
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

**注意**  
此設定中用於 DPO 的參考模型自動衍生自正在訓練的基礎模型 (未明確定義任何個別的參考模型)。DPO 特定超參數已預先設定下列預設值：  
`beta`：0.1 (控制 KL 散度正規化的強度)
`label_smoothing`：0.0 (未將平滑套用至喜好設定標籤)

```
recipes.dpo.beta=${BETA}
recipes.dpo.label_smoothing=${LABEL_SMOOTHING}
```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset
# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}"
BETA="${BETA}"
LABEL_SMOOTHING="${LABEL_SMOOTHING}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_dpo \
base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
recipes.run.name="hf_llama3_dpo" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
recipes.model.data.train_dir="$TRAIN_DIR" \
recipes.model.data.val_dir="$VAL_DIR" \
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
container="${IMAGE}" \
+cluster.container_mounts.0="/fsx:/fsx" \
recipes.model.hf_access_token="${HF_ACCESS_TOKEN}" \
recipes.dpo.enabled=true \
recipes.dpo.beta="${BETA}" \
recipes.dpo.label_smoothing="${LABEL_SMOOTHING}$" \
```

在上述指令碼中設定了所有必要參數之後，您可以透過執行該指令碼來啟動訓練任務。

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh
```

如需 Slurm 叢集組態的詳細資訊，請參閱 [在 HyperPod Slurm 上執行訓練任務](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)。

# HyperPod Slurm 叢集 PEFT-Lora 教學課程 (GPU)
<a name="hyperpod-gpu-slurm-peft-lora-tutorial"></a>

下列教學課程會設定 Slurm 環境，並在 Llama 80 億參數模型上啟動參數效率微調 (PEFT) 任務。

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
設定 HyperPod GPU Slurm 叢集  
您的 HyperPod Slurm 叢集必須啟用 Nvidia Enroot 和 Pyxis (這些項目預設為啟用)。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您需要來自 HuggingFace 的預先訓練權重，或者如果您要訓練 Llama 3.2 模型，則您必須在開始訓練之前取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## 設定 HyperPod GPU Slurm 環境
<a name="hyperpod-gpu-slurm-peft-lora-setup-hyperpod-gpu-slurm-environment"></a>

若要在 Slurm 叢集上啟動訓練任務，請執行下列動作：
+ 對 Slurm 叢集的主節點執行 SSH。
+ 登入後，請設定虛擬環境。請確定您使用的是 Python 3.9 或更新版本。

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ 將 SageMaker HyperPod 配方和 SageMaker HyperPod 轉接器儲存庫複製到共用儲存位置。共用儲存位置可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ 使用 Enroot 建立 squash 檔案。若要尋找 SMP 容器的最新版本，請參閱 [SageMaker 模型平行化程式庫的版本備註](model-parallel-release-notes.md)。如需使用 Enroot 檔案的詳細資訊，請參閱[建置最佳化 Nemo-Launcher AWS映像](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image)。

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ 若要使用 Enroot squash 檔案開始訓練，請使用下列範例來修改 `recipes_collection/config.yaml` 檔案。

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## 啟動訓練任務
<a name="hyperpod-gpu-slurm-peft-lora-launch-training-job"></a>

若要在單一 Slurm 運算節點上為序列長度為 8192 的 Llama 80 億參數模型啟動 PEFT 任務，請將啟動指令碼 `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh` 設定為下列項目：
+ `IMAGE`：來自環境設定區段的容器。
+ `HF_MODEL_NAME_OR_PATH`：在配方的 hf\$1model\$1name\$1or\$1path 參數中定義預先訓練權重的名稱或路徑。
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${YOUR_HF_MODEL_NAME_OR_PATH}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
    recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_lora \
    base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
    recipes.run.name="hf_llama3_lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    container="${IMAGE}" \
    +cluster.container_mounts.0="/fsx:/fsx" \
    recipes.model.hf_access_token="${HF_ACCESS_TOKEN}"
```

在上述指令碼中設定了所有必要參數之後，您可以透過執行該指令碼來啟動訓練任務。

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh
```

如需 Slurm 叢集組態的詳細資訊，請參閱 [在 HyperPod Slurm 上執行訓練任務](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)。

# Kubernetes 叢集預先訓練教學課程 (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

有兩種方式可在 GPU Kubernetes 叢集中啟動訓練任務：
+ (建議) [HyperPod 命令列工具](https://github.com/aws/sagemaker-hyperpod-cli)
+ NeMo 樣式啟動器

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
HyperPod GPU Kubernetes 叢集已正確設定。
共用儲存位置。它可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## GPU Kubernetes 環境設定
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

若要設定 GPU Kubernetes 環境，請執行下列動作：
+ 設定虛擬環境。請確定您使用的是 Python 3.9 或更新版本。

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ 使用下列其中一種方法安裝相依性：
  + (建議)：[HyperPod 命令列工具](https://github.com/aws/sagemaker-hyperpod-cli)方法：

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod 配方方法：

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [設定 kubectl 和 eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [安裝 Helm](https://helm.sh/docs/intro/install/)
+ 連線至您的 Kubernetes 叢集

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## 使用 SageMaker HyperPod CLI 啟動訓練任務
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

建議使用 SageMaker HyperPod 命令列介面 (CLI) 工具，搭配您的組態提交訓練任務。下列範例會提交 `hf_llama3_8b_seq16k_gpu_p5x16_pretrain` 模型的訓練任務。
+ `your_training_container`：深度學習容器 若要尋找 SMP 容器的最新版本，請參閱 [SageMaker 模型平行化程式庫的版本備註](model-parallel-release-notes.md)。
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

如果 `STATUS` 是 `PENDING` 或 `ContainerCreating`，請執行下列命令以取得詳細資訊。

```
kubectl describe pod name_of_pod
```

在任務 `STATUS` 變更為 `Running` 之後，您可以使用下列命令來檢查日誌。

```
kubectl logs name_of_pod
```

`STATUS` 會在您執行 `kubectl get pods` 時變成 `Completed`。

## 使用配方啟動器啟動訓練任務
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

或者，您可以使用 SageMaker HyperPod 配方來提交訓練任務。使用配方涉及更新 `k8s.yaml`、`config.yaml` 和執行啟動指令碼。
+ 在 `k8s.yaml` 中，更新 `persistent_volume_claims`。它會將 Amazon FSx 宣告掛載到每個運算 Pod 的 `/data` 目錄

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ 在 `config.yaml` 中，更新 `git` 下的 `repo_url_or_path`。

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ 更新 `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
  + `your_contrainer`：深度學習容器 若要尋找 SMP 容器的最新版本，請參閱 [SageMaker 模型平行化程式庫的版本備註](model-parallel-release-notes.md)。
  + (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ 啟動訓練任務

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

如果 `STATUS` 是 `PENDING` 或 `ContainerCreating`，請執行下列命令以取得詳細資訊。

```
kubectl describe pod <name-of-pod>
```

在任務 `STATUS` 變更為 `Running` 之後，您可以使用下列命令來檢查日誌。

```
kubectl logs name_of_pod
```

當您執行 `Completed` 時，`STATUS` 會變成 `kubectl get pods`。

如需 k8s 叢集組態的詳細資訊，請參閱 [在 HyperPod k8s 上執行訓練任務](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)。

# Trainium Kubernetes 叢集預先訓練教學課程
<a name="sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial"></a>

您可以使用下列其中一種方法，在 Trainium Kubernetes 叢集中啟動訓練任務。
+ (建議) [HyperPod 命令列工具](https://github.com/aws/sagemaker-hyperpod-cli)
+ NeMo 樣式啟動器

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
設定 HyperPod Trainium Kubernetes 叢集
共用儲存位置，其可以是可從叢集節點存取的 Amazon FSx 檔案系統或 NFS 系統。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## 設定您的 Trainium Kubernetes 環境
<a name="sagemaker-hyperpod-trainium-setup-trainium-kubernetes-environment"></a>

若要設定 Trainium Kubernetes 環境，請執行下列動作：

1. 完成下列教學課程中的步驟：從**下載資料集**開始的 [HuggingFace Llama3-8B 預先訓練](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#download-the-dataset)。

1. 準備模型組態。它們可在 Neuron 儲存庫中取得。針對本教學課程，您可以使用 llama3 8b 模型組態。

1. 虛擬環境設定。請確定您使用的是 Python 3.9 或更新版本。

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. 安裝相依性
   + (建議) 使用下列 HyperPod 命令列工具

     ```
     # install HyperPod command line tools
     git clone https://github.com/aws/sagemaker-hyperpod-cli
     cd sagemaker-hyperpod-cli
     pip3 install .
     ```
   + 如果您使用的是 SageMaker HyperPod 配方，請指定下列內容

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```

1. [設定 kubectl 和 eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [安裝 Helm](https://helm.sh/docs/intro/install/)

1. 連線至您的 Kubernetes 叢集

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   hyperpod connect-cluster --cluster-name "${CLUSTER_NAME}" [--region "${CLUSTER_REGION}"] [--namespace <namespace>]
   ```

1. 容器：[Neuron 容器](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)

## 使用 SageMaker HyperPod CLI 啟動訓練任務
<a name="sagemaker-hyperpod-trainium-launch-training-job-cli"></a>

建議使用 SageMaker HyperPod 命令列介面 (CLI) 工具，搭配您的組態提交訓練任務。下列範例會提交 `hf_llama3_8b_seq8k_trn1x4_pretrain` Trainium 模型的訓練任務。
+ `your_neuron_container`：[Neuron 容器](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)。
+ `your_model_config`：來自環境設定區段的模型組態
+ (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
 "cluster": "k8s",
 "cluster_type": "k8s",
 "container": "<your_neuron_contrainer>",
 "recipes.run.name": "hf-llama3",
 "recipes.run.compile": 0,
 "recipes.model.model_config": "<your_model_config>",
 "instance_type": "trn1.32xlarge",
 "recipes.data.train_dir": "<your_train_data_dir>"
}'
```

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。

```
kubectl get pods
NAME                              READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

如果 `STATUS` 是 `PENDING` 或 `ContainerCreating`，請執行下列命令以取得詳細資訊。

```
kubectl describe pod name_of_pod
```

在任務 `STATUS` 變更為 `Running` 之後，您可以使用下列命令來檢查日誌。

```
kubectl logs name_of_pod
```

當您執行 `Completed` 時，`STATUS` 會變成 `kubectl get pods`。

## 使用配方啟動器啟動訓練任務
<a name="sagemaker-hyperpod-trainium-launch-training-job-recipes"></a>

或者，使用 SageMaker HyperPod 配方來提交您的訓練任務。若要使用配方提交訓練任務，請更新 `k8s.yaml` 和 `config.yaml`。為模型執行 bash 指令碼以啟動該模型。
+ 在 `k8s.yaml` 中，更新 persistent\$1volume\$1claims，將 Amazon FSx 宣告掛載到運算節點中的 /data 目錄

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ 更新 launcher\$1scripts/llama/run\$1hf\$1llama3\$18b\$1seq8k\$1trn1x4\$1pretrain.sh
  + `your_neuron_contrainer`：來自環境設定區段的容器
  + `your_model_config`：來自環境設定區段的模型組態

  (選用) 如果您需要來自 HuggingFace 的預先訓練權重，您可以設定下列金鑰/值對，以提供 HuggingFace 權杖：

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

  ```
   #!/bin/bash
  #Users should set up their cluster type in /recipes_collection/config.yaml
  IMAGE="<your_neuron_contrainer>"
  MODEL_CONFIG="<your_model_config>"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3-8b" \
    instance_type=trn1.32xlarge \
    recipes.model.model_config="$MODEL_CONFIG" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.data.train_dir=$TRAIN_DIR \
    recipes.data.val_dir=$VAL_DIR
  ```
+ 啟動任務

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
  ```

在提交了訓練任務之後，您可以使用下列命令來驗證是否已成功提交。

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

如果 `STATUS` 位於 `PENDING` 或 `ContainerCreating`，請執行下列命令以取得更多詳細資訊。

```
kubectl describe pod name_of_pod
```

在任務 STATUS 變更為執行中之後，您可以使用下列命令來檢查日誌。

```
kubectl logs name_of_pod
```

當您執行 `Completed` 時，`STATUS` 會變成 `kubectl get pods`。

如需 k8s 叢集組態的詳細資訊，請參閱 [Trainium Kubernetes 叢集預先訓練教學課程](#sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial)。

# SageMaker 訓練任務預先訓練教學課程 (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

本教學課程會逐步引導您使用 SageMaker 訓練任務搭配 GPU 執行個體，來設定和執行預先訓練任務。
+ 設定您的環境
+ 使用 SageMaker HyperPod 配方啟動訓練任務

在開始前，請確定您具有以下先決條件。

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
Amazon FSx 檔案系統或 Amazon S3 儲存貯體，您可以在其中載入資料並輸出訓練成品。
在 Amazon SageMaker AI 上請求了 1x ml.p4d.24xlarge 和 1x ml.p5.48xlarge 的服務配額。若要請求增加服務配額，請執行下列動作：  
在 AWS Service Quotas 主控台上，導覽至 AWS 服務、
選擇 **Amazon SageMaker AI**。
選擇一個 ml.p4d.24xlarge 和一個 ml.p5.48xlarge 執行個體。
使用下列受管政策建立 AWS Identity and Access Management(IAM) 角色，以授予 SageMaker AI 執行範例的許可。  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您要使用來自 HuggingFace 的模型權重進行預先訓練或微調，則必須取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## GPU SageMaker 訓練任務環境設定
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

執行 SageMaker 訓練任務之前，請先執行 `aws configure`命令來設定您的 AWS 登入資料和偏好的區域。除了 configure 命令之外，您也可以透過 `AWS_ACCESS_KEY_ID`、`AWS_SECRET_ACCESS_KEY` 和 `AWS_SESSION_TOKEN.` 等環境變數提供憑證。如需詳細資訊，請參閱 [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk)。

我們強烈建議在 SageMaker AI JupyterLab 中使用 SageMaker AI Jupyter 筆記本，來啟動 SageMaker 訓練任務。如需詳細資訊，請參閱[SageMaker JupyterLab](studio-updated-jl.md)。
+ (選用) 設定虛擬環境和相依性。如果您是在 Amazon SageMaker Studio 中使用 Jupyter 筆記本，您可以略過此步驟。請確定您使用的是 Python 3.9 或更新版本。

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ 安裝 SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`：GPU 容器是由 SageMaker AI Python SDK 自動設定的。您也可以提供自己的容器。
**注意**  
如果您正在執行 Llama 3.2 多模態訓練任務，則 `transformers` 版本必須為 `4.45.2 ` 或更新版本。

  只有在您使用 SageMaker AI Python SDK 時，才會在 `source_dir` 中將 `transformers==4.45.2` 附加至 `requirements.txt`。例如，如果您要在 SageMaker AI JupyterLab 的筆記本中使用它，請附加它。

  如果您使用要使用叢集類型 `sm_jobs` 啟動的 HyperPod 配方，這將會自動完成。

## 使用 Jupyter 筆記本啟動訓練任務
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

您可以使用下列 Python 程式碼，搭配您的配方執行 SageMaker 訓練任務。它利用來自 [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) 的 PyTorch 估算器來提交配方。下列範例會在 SageMaker AI 訓練平台上啟動 llama3-8b 配方。

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

上述程式碼會使用訓練配方建立 PyTorch 估算器物件，然後使用 `fit()` 方法符合模型。使用 training\$1recipe 參數來指定您要用於訓練的配方。

**注意**  
如果您正在執行 Llama 3.2 多模態訓練任務，則 transformers 版本必須為 4.45.2 或更新版本。

只有在您直接使用 SageMaker AI Python SDK 時，才會在 `source_dir` 中將 `transformers==4.45.2` 附加至 `requirements.txt`。例如，當您使用 Jupyter 筆記本時，必須將版本附加至文字檔案。

部署 SageMaker 訓練任務的端點時，您必須指定正在使用的映像 URI。如果未提供映像 URI，估算器會使用訓練映像做為用於部署的映像。SageMaker HyperPod 提供的訓練映像不包含推論和部署所需的相依性。以下是如何使用推論映像進行部署的範例：

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**注意**  
在 Sagemaker 筆記本執行個體上執行上述程式碼可能需要超過 SageMaker AI JupyterLab 提供的預設 5GB 儲存空間。如果您遇到空間無法使用的問題，請建立新的筆記本執行個體，您可以在其中使用不同的筆記本執行個體，並增加筆記本的儲存空間。

## 使用配方啟動器啟動訓練任務
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

更新 `./recipes_collection/cluster/sm_jobs.yaml` 檔案以看起來如下所示：

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

更新 `./recipes_collection/config.yaml` 以在 `cluster` 和 `cluster_type` 中指定 `sm_jobs`。

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

使用以下命令啟動任務。

```
python3 main.py --config-path recipes_collection --config-name config
```

如需設定 SageMaker 訓練任務的詳細資訊，請參閱「在 SageMaker 訓練任務上執行訓練任務」。

# Trainium SageMaker 訓練任務預先訓練教學課程
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial"></a>

本教學課程會引導您使用 SageMaker 訓練任務搭配 AWS Trainium 執行個體來設定和執行訓練前任務。
+ 設定您的環境
+ 啟動訓練任務

在開始前，請確定您具有以下先決條件。

**先決條件**  
開始設定環境之前，請確定您具有下列先決條件：  
Amazon FSx 檔案系統或 S3 儲存貯體，您可以在其中載入資料並輸出訓練成品。
在 Amazon SageMaker AI 上請求 `ml.trn1.32xlarge` 執行個體的服務配額。若要請求增加服務配額，請執行下列動作：  
導覽至 AWS Service Quotas 主控台。
選擇 AWS 服務。
選取 JupyterLab。
為 `ml.trn1.32xlarge` 指定一個執行個體。
使用 `AmazonSageMakerFullAccess`和 `AmazonEC2FullAccess`受管政策建立 AWS Identity and Access Management (IAM) 角色。這些政策為 Amazon SageMaker AI 提供執行範例的許可。
採用下列其中一種格式的資料：  
JSON
JSONGZ (壓縮 JSON)
ARROW
(選用) 如果您需要來自 HuggingFace 的預先訓練權重，或者如果您要訓練 Llama 3.2 模型，則您必須在開始訓練之前取得 HuggingFace 權杖。如需取得權杖的詳細資訊，請參閱[使用者存取權杖](https://huggingface.co/docs/hub/en/security-tokens)。

## 為 Trainium SageMaker 訓練任務設定您的環境
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-environment-setup"></a>

執行 SageMaker 訓練任務之前，請使用 `aws configure`命令來設定您的 AWS 登入資料和偏好的區域 。或者，您也可以透過 `AWS_ACCESS_KEY_ID`、`AWS_SECRET_ACCESS_KEY` 和 `AWS_SESSION_TOKEN` 等環境變數提供憑證。如需詳細資訊，請參閱 [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk)。

我們強烈建議在 SageMaker AI JupyterLab 中使用 SageMaker AI Jupyter 筆記本，來啟動 SageMaker 訓練任務。如需詳細資訊，請參閱[SageMaker JupyterLab](studio-updated-jl.md)。
+ (選用) 如果您在 Amazon SageMaker Studio 中使用 Jupyter 筆記本，則可以略過執行下列命令。請務必使用版本 >= python 3.9

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ 安裝 SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ 
  + 如果您要執行 llama 3.2 多模態訓練任務，則 `transformers` 版本必須為 `4.45.2` 或更新版本。
    + 只有在您使用 SageMaker AI Python SDK 時，才會在 source\$1dir 中將 `transformers==4.45.2` 附加至 `requirements.txt`。
    + 如果您使用要使用 `sm_jobs` 做為叢集類型來啟動的 HyperPod 配方，則不需要指定轉換器版本。
  + `Container`：SageMaker AI Python SDK 會自動設定 Neuron 容器。

## 使用 Jupyter 筆記本啟動訓練任務
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-notebook"></a>

您可以使用下列 Python 程式碼，以使用您的配方執行 SageMaker 訓練任務。它利用來自 [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) 的 PyTorch 估算器來提交配方。下列範例會將 llama3-8b 配方啟動為 SageMaker AI 訓練任務。
+ `compiler_cache_url`：用來儲存編譯成品的快取，例如 Amazon S3 成品。

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)
```

上述程式碼會使用訓練配方建立 PyTorch 估算器物件，然後使用 `fit()` 方法符合模型。使用 `training_recipe` 參數來指定您要用於訓練的配方。

## 使用配方啟動器啟動訓練任務
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-recipes"></a>
+ 更新 `./recipes_collection/cluster/sm_jobs.yaml`
  + compiler\$1cache\$1url：用來儲存成品的 URL。它可以是 Amazon S3 URL。

  ```
  sm_jobs_config:
    output_path: <s3_output_path>
    wait: True
    tensorboard_config:
      output_path: <s3_output_path>
      container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
    wait: True  # Whether to wait for training job to finish
    inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
      s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
        train: <s3_train_data_path>
        val: null
    additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
      max_run: 180000
      image_uri: <your_image_uri>
      enable_remote_debug: True
      py_version: py39
    recipe_overrides:
      model:
        exp_manager:
          exp_dir: <exp_dir>
        data:
          train_dir: /opt/ml/input/data/train
          val_dir: /opt/ml/input/data/val
  ```
+ 更新 `./recipes_collection/config.yaml`

  ```
  defaults:
    - _self_
    - cluster: sm_jobs
    - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
  cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
  
  instance_type: ml.trn1.32xlarge
  base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
  ```
+ 使用 `main.py` 啟動任務

  ```
  python3 main.py --config-path recipes_collection --config-name config
  ```

如需設定 SageMaker 訓練任務的詳細資訊，請參閱[SageMaker 訓練任務預先訓練教學課程 (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)。

# 預設組態
<a name="default-configurations"></a>

本節概述使用 SageMaker HyperPod 啟動和自訂大型語言模型 (LLM) 訓練程序所需的基本元件和設定。本節涵蓋組成訓練任務基礎的金鑰儲存庫、組態檔案和配方結構。了解這些預設組態對於有效設定和管理 LLM 訓練工作流程至關重要，無論您使用的是預先定義的配方，還是自訂這些配方以符合您的特定需求。

**Topics**
+ [

# GitHub 儲存庫
](github-repositories.md)
+ [

# 一般組態
](sagemaker-hyperpod-recipes-general-configuration.md)

# GitHub 儲存庫
<a name="github-repositories"></a>

若要啟動訓練任務，您可以使用來自兩個不同 GitHub 儲存庫的檔案：
+ [SageMaker HyperPod 配方](https://github.com/aws/sagemaker-hyperpod-recipes)
+ [適用於 NeMo 的 SageMaker HyperPod 訓練轉接器](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo)

這些儲存庫包含啟動、管理和自訂大型語言模型 (LLM) 訓練程序的必要元件。您可以使用儲存庫中的指令碼，來設定和執行 LLM 的訓練任務。

## HyperPod 配方儲存庫
<a name="sagemaker-hyperpod-recipe-repository"></a>

使用 [SageMaker HyperPod 配方](https://github.com/aws/sagemaker-hyperpod-recipes)儲存庫來取得配方。

1. `main.py`：此檔案會做為主要進入點，用於啟動將訓練任務提交至叢集或 SageMaker 訓練任務的程序的。

1. `launcher_scripts`：此目錄包含常用指令碼的集合，旨在協助各種大型語言模型 (LLM) 的訓練程序。

1. `recipes_collection`：此資料夾包含由開發人員提供的預先定義 LLM 配方的編譯。使用者可以將這些配方與其自訂資料結合使用，來訓練根據其特定要求量身打造的 LLM 模型。

您可以使用 SageMaker HyperPod 配方來啟動訓練或微調任務。無論您使用的叢集為何，提交任務的程序都相同。例如，您可以使用相同的指令碼，將任務提交至 Slurm 或 Kubernetes 叢集。啟動器會根據三個組態檔案來分派訓練任務：

1. 一般組態 (`config.yaml`)：包括常見設定，例如訓練任務中使用的預設參數或環境變數。

1. 叢集組態 (叢集)：僅限使用叢集的訓練任務。如果您要將訓練任務提交至 Kubernetes 叢集，則可能需要指定磁碟區、標籤等資訊或重新啟動政策。對於 Slurm 叢集，您可能需要指定 Slurm 任務名稱。所有參數都與您正在使用的特定叢集相關。

1. 配方 (配方)：配方包含訓練任務的設定，例如模型類型、碎片程度或資料集路徑。例如，您可以將 Llama 指定為訓練模型，並使用模型或資料平行處理技術對其進行訓練，例如跨八個機器的全碎片分散式平行 (FSDP)。您也可以針對訓練任務指定不同的檢查點頻率或路徑。

指定了配方後，您可以執行啟動器指令碼，透過 `main.py` 進入點根據組態在叢集上指定端對端訓練任務。對於您使用的每個配方，都有隨附的 Shell 指令碼位於 launch\$1scripts 資料夾中。這些範例會逐步引導您提交和啟動訓練任務。下圖說明 SageMaker HyperPod 配方啟動器如何根據上述內容將訓練任務提交至叢集。目前，SageMaker HyperPod 配方啟動器建置在 Nvidia NeMo Framework 啟動器之上。如需詳細資訊，請參閱 [NeMo 啟動器指南](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)。

![\[說明 HyperPod 配方啟動器工作流程的圖表。左側的虛線方塊內有三個檔案圖示，分別標示為 "Recipe"、"config.yaml" 和 "slurm.yaml or k8s.yaml or sm_job.yaml (Cluster config)"。箭頭從這個方塊指向標示為 "HyperPod recipe Launcher" 的中央方塊。從這個中央方塊中，另一個箭頭指向 "Training Job"，其中 "main.py" 寫在箭頭上方。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/sagemaker-hyperpod-recipe-launcher.png)


## HyperPod 配方轉接器儲存庫
<a name="hyperpod-recipe-adapter"></a>

SageMaker HyperPod 訓練轉接器是一種訓練架構。您可以使用它來管理訓練任務的整個生命週期。使用轉接器將您模型的預先訓練或微調分散到多部電腦。轉接器使用不同的平行化技術來分散訓練。它也會處理儲存檢查點的實作和管理。如需詳細資訊，請參閱[進階設定](cluster-specific-configurations-advanced-settings.md)。

使用 [SageMaker HyperPod 配方轉接器儲存庫](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo)來使用配方轉接器。

1. `src`：此目錄包含大規模語言模型 (LLM) 訓練的實作，涵蓋模型平行化、混合精準度訓練和檢查點管理等各種功能。

1. `examples`：此資料夾提供範例集合，示範如何建立用於訓練 LLM 模型的進入點，做為使用者的實際指南。

# 一般組態
<a name="sagemaker-hyperpod-recipes-general-configuration"></a>

config.yaml 檔案指定訓練配方和叢集。它還包含執行時期組態，例如訓練任務的環境變數。

```
defaults:
  - _self_
  - cluster: slurm 
  - recipes: training/llama/hf_llama3_8b_seq8192_gpu
instance_type: p5.48xlarge
git:
  repo_url_or_path: null
  branch: null
  commit: null
  entry_script: null
  token: null
env_vars:
  NCCL_DEBUG: WARN
```

您可以修改 `config.yaml` 中的下列參數：

1. `defaults`：指定您的預設設定，例如預設叢集或預設配方。

1. `instance_type`：修改 Amazon EC2 執行個體類型，以符合您正在使用的執行個體類型。

1. `git`：指定訓練任務的 SageMaker HyperPod 配方轉接器儲存庫位置。

1. `env_vars`：您可以指定要傳遞至執行時期訓練任務的環境變數。例如，您可以透過指定 NCCL\$1DEBUG 環境變數來調整 NCCL 的記錄層級。

配方是定義訓練任務架構的核心組態。此檔案包含許多對您訓練任務至關重要的資訊，例如：
+ 是否使用模型平行化
+ 資料集的來源
+ 混合精確度訓練
+ 檢查點相關組態

您可以依原狀使用配方。您也可以使用下列資訊來修改它們。

## run
<a name="run"></a>

以下是用於執行訓練任務的基本執行資訊。

```
run:
  name: llama-8b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
  model_type: hf
```

1. `name`：在組態檔案中指定訓練任務的名稱。

1. `results_dir`：您可以指定訓練任務結果存放所在的目錄。

1. `time_limit`：您可以為訓練任務設定訓練時間上限，以防止其佔用硬體資源太長時間。

1. `model_type`：您可以指定正在使用的模型類型。例如，如果您的模型來自 HuggingFace，您可以指定 `hf`。

## exp\$1manager
<a name="exp-manager"></a>

exp\$1manager 會設定實驗。使用 exp\$1manager，您可以指定輸出目錄或檢查點設定等欄位。以下是如何設定 exp\$1manager 的範例。

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

1. `exp_dir`：實驗目錄包含訓練任務的標準輸出和標準錯誤檔案。根據預設，它會使用您的目前目錄。

1. `name`：用來在 exp\$1dir 下識別實驗的實驗名稱。

1. `create_tensorboard_logger`：指定 `True` 或 `False` 以啟用或停用 TensorBoard 記錄器。

## 檢查點
<a name="checkpointing"></a>

以下是我們支援的三種檢查點類型：
+ 自動檢查點
+ 手動檢查點
+ 完整檢查點

### 自動檢查點
<a name="auto-checkpointing"></a>

如果您要儲存或載入由 SageMaker HyperPod 配方轉接器自動管理的檢查點，您可以啟用 `auto_checkpoint`。若要啟用 `auto_checkpoint`，請將 `enabled` 設定為 `True`。您可以使用自動檢查點進行訓練和微調。您可以針對共用檔案系統和 Amazon S3 使用自動檢查點。

```
exp_manager
  checkpoint_dir: ${recipes.exp_manager.exp_dir}/checkpoints/
  auto_checkpoint:
    enabled: True
```

自動檢查點會使用自動計算的最佳儲存間隔，以非同步方式儲存 local\$1state\$1dict。

**注意**  
在此檢查點模式下，自動儲存的檢查點不支援在訓練執行之間重新碎片。若要從最新自動儲存的檢查點繼續，您必須保留相同的碎片度。您不需要指定額外資訊即可自動繼續。

### 手動檢查點
<a name="manual-checkpointing"></a>

您可以修改 `checkpoint_callback_params` 以非同步方式將中繼檢查點儲存在 shared\$1state\$1dict 中。例如，您可以指定下列組態，每 10 個步驟啟用碎片檢查點，並保留最新的 3 個檢查點。

碎片檢查點可讓您在訓練執行之間變更碎片度，並透過設定 `resume_from_checkpoint` 載入檢查點。

**注意**  
若是 PEFT 微調，碎片檢查點不支援 Amazon S3。
自動和手動檢查點是互斥的。
僅允許 FSDP 碎片度和複寫度變更。

```
exp_manager:
  checkpoint_callback_params:
    # Set save_top_k = 0 to disable sharded checkpointing
    save_top_k: 3
    every_n_train_steps: 10
    monitor: "step"
    mode: "max"
    save_last: False
  resume_from_checkpoint: ${recipes.exp_manager.exp_dir}/checkpoints/
```

若要進一步了解檢查點，請參閱[使用 SMP 進行檢查點](model-parallel-core-features-v2-checkpoints.md)。

### 完整檢查點
<a name="full-checkpointing"></a>

匯出的 full\$1state\$1dict 檢查點可以用於推論或微調。您可以透過 hf\$1model\$1name\$1or\$1path 載入完整檢查點。在此模式下，只會儲存模型權重。

若要匯出 full\$1state\$1dict 模型，您可以設定下列參數。

**注意**  
目前，Amazon S3 檢查點不支援完整檢查點。如果您啟用完整檢查點，則無法設定 `exp_manager.checkpoint_dir` 的 S3 路徑。不過，您可以將 `exp_manager.export_full_model.final_export_dir` 設定為本機檔案系統上的特定目錄，同時將 `exp_manager.checkpoint_dir` 設定為 Amazon S3 路徑。

```
exp_manager:
  export_full_model:
    # Set every_n_train_steps = 0 to disable full checkpointing
    every_n_train_steps: 0
    save_last: True
    final_export_dir : null
```

## 模型
<a name="model"></a>

定義模型架構和訓練程序的各個層面。這包括模型平行化、精確度和資料處理的設定。以下是您可以在模型區段內設定的關鍵元件：

### 模型平行化
<a name="model-parallelism"></a>

在您指定了配方之後，請定義您要訓練的模型。您也可以定義模型平行化。例如，您可以定義 tensor\$1model\$1parallel\$1degree。您可以啟用其他功能，例如使用 FP8 精確度進行訓練。例如，您可以使用張量平行化和內容平行化訓練模型：

```
model:
  model_type: llama_v3
  # Base configs
  train_batch_size: 4
  val_batch_size: 1
  seed: 12345
  grad_clip: 1.0

  # Model parallelism
  tensor_model_parallel_degree: 4
  expert_model_parallel_degree: 1
  context_parallel_degree: 2
```

若要更好地了解不同類型的模型平行化技術，您可以參考下列方法：

1. [張量平行化](model-parallel-core-features-v2-tensor-parallelism.md)

1. [專家平行化](model-parallel-core-features-v2-expert-parallelism.md)

1. [內容平行化](model-parallel-core-features-v2-context-parallelism.md)

1. [混合碎片資料平行化](model-parallel-core-features-v2-sharded-data-parallelism.md)

### FP8
<a name="fp8"></a>

若要啟用 FP8 (8 位元浮點精確度)，您可以在下列範例中指定 FP8 相關組態：

```
model:
  # FP8 config
  fp8: True
  fp8_amax_history_len: 1024
  fp8_amax_compute_algo: max
```

請務必注意，目前僅對 P5 執行個體類型支援 FP8 資料格式。如果您使用的是較舊的執行個體類型，例如 P4，請對您的模型訓練程序停用 FP8 功能。如需 FP8 的詳細資訊，請參閱[混合精確度訓練](model-parallel-core-features-v2-mixed-precision.md)。

### data
<a name="data"></a>

您可以在資料下新增資料路徑，為您的訓練任務指定自訂資料集。我們系統中的資料模組支援下列資料格式：

1. JSON

1. JSONGZ (壓縮 JSON)

1. ARROW

不過，您負責準備自己的預先記號化資料集。如果您是具有特定要求的進階使用者，也可以選擇實作和整合自訂的資料模組。如需 HuggingFace 資料集的詳細資訊，請參閱[資料集](https://huggingface.co/docs/datasets/v3.1.0/en/index)。

```
model:
  data:
    train_dir: /path/to/your/train/data
    val_dir: /path/to/your/val/data
    dataset_type: hf
    use_synthetic_data: False
```

您可以指定訓練模型的方式。根據預設，配方會使用預先訓練而非微調。下列範例會將配方設定為使用 LoRA (低排名調整) 執行微調任務。

```
model:
  # Fine tuning config
  do_finetune: True
  # The path to resume from, needs to be HF compatible
  hf_model_name_or_path: null
  hf_access_token: null
  # PEFT config
  peft:
    peft_type: lora
    rank: 32
    alpha: 16
    dropout: 0.1
```

如需配方的相關資訊，請參閱 [SageMaker HyperPod 配方](https://github.com/aws/sagemaker-hyperpod-recipes)。

# 叢集特定的組態
<a name="cluster-specific-configurations"></a>

SageMaker HyperPod 提供跨不同叢集環境執行訓練任務的彈性。每個環境都有自己的組態要求和設定程序。本節概述在 SageMaker HyperPod Slurm、SageMaker HyperPod k8s 和 SageMaker 訓練任務中執行訓練任務所需的步驟和組態。了解這些組態對於有效利用所選環境中分散式訓練的強大能力至關重要。

您可以在下列叢集環境中使用配方：
+ SageMaker HyperPod Slurm 協同運作
+ SageMaker HyperPod Amazon Elastic Kubernetes Service 協同運作
+ SageMaker 訓練任務

若要在叢集中啟動訓練任務，請設定並安裝對應的叢集組態和環境。

**Topics**
+ [

# 在 HyperPod Slurm 上執行訓練任務
](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)
+ [

# 在 HyperPod k8s 上執行訓練任務
](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)
+ [

# 執行 SageMaker 訓練任務
](cluster-specific-configurations-run-sagemaker-training-job.md)

# 在 HyperPod Slurm 上執行訓練任務
<a name="cluster-specific-configurations-run-training-job-hyperpod-slurm"></a>

SageMaker HyperPod 配方支援將訓練任務提交至 GPU/Trainium slurm 叢集。提交訓練任務之前，請更新叢集組態。使用下列其中一種方法來更新叢集組態：
+ 修改 `slurm.yaml`
+ 透過命令列將其覆寫

更新了叢集組態後，請安裝環境。

## 設定叢集
<a name="cluster-specific-configurations-configure-cluster-slurm-yaml"></a>

若要將訓練任務提交至 Slurm 叢集，請指定 Slurm 特定的組態。修改 `slurm.yaml` 以設定 Slurm 叢集。下列是 Slurm 叢集組態的範例。您可以針對自己的訓練需求修改此檔案：

```
job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False 
stderr_to_stdout: True
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" 
  post_launch_commands: 
container_mounts: 
  - "/fsx:/fsx"
```

1. `job_name_prefix`：指定任務名稱字首，以輕鬆識別您提交至 Slurm 叢集的任務。

1. `slurm_create_submission_file_only`：將此組態設定為 True 以進行試轉，以協助您偵錯。

1. `stderr_to_stdout`：指定是否將標準錯誤 (stderr) 重新導向至標準輸出 (stdout)。

1. `srun_args`：自訂其他 Srun 組態，例如排除特定運算節點。如需詳細資訊，請參閱 Srun 文件。

1. `slurm_docker_cfg`：SageMaker HyperPod 配方啟動器會啟動 Docker 容器來執行您的訓練任務。您可以在此參數內指定其他 Docker 引數。

1. `container_mounts`：指定您要掛載至配方啟動器容器的磁碟區，讓您的訓練任務存取這些磁碟區中的檔案。

# 在 HyperPod k8s 上執行訓練任務
<a name="cluster-specific-configurations-run-training-job-hyperpod-k8s"></a>

SageMaker HyperPod 配方支援將訓練任務提交至 GPU/Trainium Kubernetes 叢集。在您提交訓練任務之前，請執行下列其中一個動作：
+ 修改 `k8s.yaml` 叢集組態檔案
+ 透過命令列覆寫叢集組態

完成上述任一步驟後，請安裝對應環境。

## 使用 `k8s.yaml` 設定叢集
<a name="cluster-specific-configurations-configure-cluster-k8s-yaml"></a>

若要將訓練任務提交至 Kubernetes 叢集，您可以指定 Kubernetes 特定的組態。這些組態包括叢集命名空間或持久性磁碟區的位置。

```
pullPolicy: Always
restartPolicy: Never
namespace: default
persistent_volume_claims:
  - null
```

1. `pullPolicy`：您可以在提交訓練任務時指定提取政策。如果您指定「一律」，Kubernetes 叢集一律會從儲存庫提取您的映像。如需詳細資訊，請參閱[映像提取政策](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)。

1. `restartPolicy`：指定是否在訓練任務失敗時將其重新啟動。

1. `namespace`：您可以指定要在其中提交訓練任務的 Kubernetes 命名空間。

1. `persistent_volume_claims`：您可以為訓練任務指定共用磁碟區，讓所有訓練程序存取磁碟區中的檔案。

# 執行 SageMaker 訓練任務
<a name="cluster-specific-configurations-run-sagemaker-training-job"></a>

SageMaker HyperPod 配方支援提交 SageMaker 訓練任務。提交訓練任務之前，您必須更新叢集組態 `sm_job.yaml` 並安裝對應環境。

## 使用您的配方做為 SageMaker 訓練任務
<a name="cluster-specific-configurations-cluster-config-sm-job-yaml"></a>

如果您未託管叢集，則可以使用配方做為 SageMaker 訓練任務。您必須修改 SageMaker 訓練任務組態檔案 `sm_job.yaml`，才能執行您的配方。

```
sm_jobs_config:
  output_path: null 
  tensorboard_config:
    output_path: null 
    container_logs_path: null
  wait: True 
  inputs: 
    s3: 
      train: null
      val: null
    file_system:  
      directory_path: null
  additional_estimator_kwargs: 
    max_run: 1800
```

1. `output_path`：您可以指定將模型儲存至 Amazon S3 URL 的位置。

1. `tensorboard_config`：您可以指定 TensorBoard 相關組態，例如輸出路徑或 TensorBoard 日誌路徑。

1. `wait`：您可以指定在提交訓練任務時是否要等待任務完成。

1. `inputs`：您可以指定訓練和驗證資料的路徑。資料來源可以來自共用檔案系統，例如 Amazon FSx 或 Amazon S3 URL。

1. `additional_estimator_kwargs`：用於將訓練任務提交至 SageMaker 訓練任務平台的其他估算器引數。如需詳細資訊，請參閱[演算法估算器](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html)。

# 考量事項
<a name="cluster-specific-configurations-special-considerations"></a>

當您使用 Amazon SageMaker HyperPod 配方時，有一些因素可能會影響模型訓練的程序。
+ `transformers` 版本對於 Llama 3.2 必須為 `4.45.2` 或更新版本。如果您使用的是 Slurm 或 K8s 工作流程，則會自動更新版本。
+ Mixtral 不支援 8 位元浮點精確度 (FP8)
+ Amazon EC2 p4 執行個體不支援 FP8

# 進階設定
<a name="cluster-specific-configurations-advanced-settings"></a>

SageMaker HyperPod 配方轉接器建置在 Nvidia Nemo 和 Pytorch-lightning 架構之上。如果您已經使用這些架構，則將您的自訂模型或功能整合到 SageMaker HyperPod 配方轉接器是一個類似的程序。除了修改配方轉接器之外，您還可以變更自己的預先訓練或微調指令碼。如需撰寫自訂訓練指令碼的指引，請參閱[範例](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/tree/main/examples)。

## 使用 SageMaker HyperPod 轉接器建立您自己的模型
<a name="cluster-specific-configurations-use-hyperpod-adapter-create-model"></a>

在配方轉接器內，您可以在下列位置自訂下列檔案：

1. `collections/data`：包含負責載入資料集的模組。目前，它僅支援來自 HuggingFace 的資料集。如果您有更進階的要求，程式碼結構可讓您在相同的資料夾內新增自訂資料模組。

1. `collections/model`：包括各種語言模型的定義。目前，它支援常見的大型語言模型，例如 Llama、Mixtral 和 Mistral。您可以靈活地在此資料夾內引入自己的模型定義。

1. `collections/parts`：此資料夾包含以分散式方式訓練模型的策略。其中一個範例是全碎片資料平行 (FSDP) 策略，允許跨多個加速器將大型語言模型碎片化。此外，這些策略支援各種形式的模型平行化。您也可以選擇引入自己的自訂訓練策略進行模型訓練。

1. `utils`：包含旨在協助管理訓練任務的各種公用程式。它可以做為您自己工具的儲存庫。您可以將自己的工具用於故障診斷或基準測試等任務。您也可以在此資料夾內新增自己的個人化 PyTorch Lightning 回呼。您可以使用 PyTorch Lightning 回呼，將特定功能或操作無縫整合到訓練生命週期。

1. `conf`：包含用於驗證訓練任務中特定參數的組態結構描述定義。如果引入新的參數或組態，您可以將自訂的結構描述新增至此資料夾。您可以使用自訂的結構描述來定義驗證規則。您可以驗證資料類型、範圍或任何其他參數限制條件。您也可以定義自己的自訂結構描述來驗證參數。

# 附錄
<a name="appendix"></a>

使用下列資訊取得監控和分析訓練結果的相關資訊。

## 監控訓練結果
<a name="monitor-training-results"></a>

監控和分析訓練結果對於開發人員評估收斂和針對問題進行疑難排解至關重要。SageMaker HyperPod 配方提供 Tensorboard 整合來分析訓練行為。為了解決分析大型分散式訓練任務的挑戰，這些配方也結合了 VizTracer。VizTracer 是一種低負荷工具，用於追蹤和視覺化 Python 程式碼執行。如需 VizTracer 的詳細資訊，請參閱 [VizTracer](https://viztracer.readthedocs.io/en/latest/installation.html)。

下列各節會引導您完成在 SageMaker HyperPod 配方中實作這些功能的程序。

### Tensorboard
<a name="tensorboard"></a>

Tensorboard 是視覺化和分析訓練程序的強大工具。若要啟用 Tensorboard，請設定下列參數來修改您的配方：

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

啟用 Tensorboard 記錄器後，訓練日誌便會產生並存放在實驗目錄內。指導的實驗定義在 exp\$1manager.exp\$1dir 中。若要在本機存取和分析這些日誌，請使用下列程序：

**存取和分析日誌**

1. 將 Tensorboard 實驗資料夾從您的訓練環境下載至本機電腦。

1. 在您的本機電腦上開啟終端機或命令提示。

1. 導覽至包含所下載實驗資料夾的目錄。

1. 使用以下命令啟動 Tensorboard。

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. 開啟您的 Web 瀏覽器並造訪 http://localhost:8008。

您現在可以在 Tensorboard 介面內查看訓練任務的狀態和視覺化。查看狀態和視覺化可協助您監控和分析訓練程序。監控和分析訓練程序可協助您洞悉模型的行為和效能。如需如何使用 Tensorboard 監控和分析訓練的詳細資訊，請參閱 [NVIDIA NeMo Framework 使用者指南](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/index.html)。

### VizTracer
<a name="viztracer"></a>

若要啟用 VizTracer，您可以透過將 model.viztracer.enabled 參數設定為 true 來修改配方。例如，您可以更新 llama 配方以啟用 VizTracer，方法是新增下列組態：

```
model:
  viztracer:
    enabled: true
```

訓練完成後，您的 VizTracer 設定檔位於實驗資料夾 exp\$1dir/result.json 中。若要分析您的設定檔，您可以使用 vizviewer 工具下載並開啟它：

```
vizviewer --port <port> result.json
```

此命令會在連接埠 9001 上啟動 vizviewer。您可以在瀏覽器中指定 http://localhost:<port> 來檢視 VizTracer。在您開啟 VizTracer 之後，就會開始分析訓練。如需使用 VizTracer 的詳細資訊，請參閱 VizTracer 文件。

## SageMaker JumpStart 與 SageMaker HyperPod
<a name="sagemaker-jumpstart-vs-hyperpod"></a>

雖然 SageMaker JumpStart 提供微調功能，但 SageMaker HyperPod 配方提供下列項目：
+ 對訓練迴圈的其他精細控制
+ 您自己模型和資料的配方自訂
+ 模型平行化的支援

當您需要存取模型的超參數、多節點訓練，以及訓練迴圈的自訂選項時，請使用 SageMaker HyperPod 配方。

如需在 SageMaker JumpStart 中微調您模型的詳細資訊，請參閱[使用 `JumpStartEstimator` 類別微調公開可用的基礎模型](jumpstart-foundation-models-use-python-sdk-estimator-class.md)