翻訳は機械翻訳により提供されています。提供された翻訳内容と英語版の間で齟齬、不一致または矛盾がある場合、英語版が優先します。

# Kubernetes クラスターの事前トレーニングチュートリアル (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

GPU Kubernetes クラスターでトレーニングジョブを起動するには、次の 2 つの方法があります。
+ (推奨) [HyperPod コマンドラインツール](https://github.com/aws/sagemaker-hyperpod-cli)
+ NeMo スタイルランチャー

**前提条件**  
環境のセットアップを開始する前に、以下を確認します。  
HyperPod GPU Kubernetes クラスターが適切にセットアップされている。
共有ストレージの場所。クラスターノードからアクセスできる Amazon FSx ファイルシステムまたは NFS システムでかまいません。
以下の形式のいずれか。  
JSON
JSONGZ (圧縮 JSON)
ARROW
(オプション) HuggingFace のモデル重みを事前トレーニングまたはファインチューニングに使用する場合は、HuggingFace トークンを取得する必要があります。アクセストークンの詳細については、「[ユーザーアクセストークン](https://huggingface.co/docs/hub/en/security-tokens)」を参照してください。

## GPU Kubernetes 環境のセットアップ
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

GPU Kubernetes 環境を設定するには、以下を実行します。
+ 仮想環境をセットアップします。Python 3.9 以降を使用していることを確認します。

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ 以下のいずれかの方法で、依存関係をインストールします。
  + (推奨): [HyperPod コマンドラインツール](https://github.com/aws/sagemaker-hyperpod-cli)の方法:

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod レシピの方法:

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [kubectl と eksctl のセットアップ](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [Helm をインストールする](https://helm.sh/docs/intro/install/)
+ Kubernetes クラスターに接続する

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## SageMaker HyperPod CLI を使用してトレーニングジョブを起動する
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

SageMaker HyperPod コマンドラインインターフェイス (CLI) ツールを使用して、設定でトレーニングジョブを送信することをお勧めします。次の例では、`hf_llama3_8b_seq16k_gpu_p5x16_pretrain` モデルのトレーニングジョブを送信します。
+ `your_training_container`: 深層学習コンテナ。SMP コンテナの最新リリースを確認するには、「[SageMaker モデル並列処理ライブラリのリリースノート](model-parallel-release-notes.md)」を参照してください。
+ (オプション) 次の key-value ペアを設定することで、HuggingFace から事前トレーニング済みの重みが必要な場合は、HuggingFace トークンを指定できます。

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

トレーニングジョブを送信したら、次のコマンドを使用して、送信が正常に完了したかどうかを確認できます。

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

`STATUS` が `PENDING` または `ContainerCreating` の場合は、以下のコマンドを実行して詳細を取得します。

```
kubectl describe pod name_of_pod
```

ジョブの `STATUS` が `Running` に変わったら、以下のコマンドを使用してログを確認できます。

```
kubectl logs name_of_pod
```

`STATUS` は、`kubectl get pods` を実行すると、`Completed` に変わります。

## レシピランチャーを使用してトレーニングジョブを起動する
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

または、SageMaker HyperPod レシピを使用してトレーニングジョブを送信することもできます。レシピを使用するには、`k8s.yaml`、`config.yaml` を更新し、起動スクリプトを実行します。
+ `k8s.yaml` で、`persistent_volume_claims` を更新します。Amazon FSx クレームを各コンピューティングポッドの `/data` ディレクトリにマウントします。

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ `config.yaml` で、`repo_url_or_path` の下の `git` を更新します。

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` の更新
  + `your_contrainer`: 深層学習コンテナ。SMP コンテナの最新リリースを確認するには、「[SageMaker モデル並列処理ライブラリのリリースノート](model-parallel-release-notes.md)」を参照してください。
  + (オプション) 次の key-value ペアを設定することで、HuggingFace から事前トレーニング済みの重みが必要な場合は、HuggingFace トークンを指定できます。

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ トレーニングジョブを起動する

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

トレーニングジョブを送信したら、次のコマンドを使用して、送信が正常に完了したかどうかを確認できます。

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

`STATUS` が `PENDING` または `ContainerCreating` の場合は、以下のコマンドを実行して詳細を取得します。

```
kubectl describe pod <name-of-pod>
```

ジョブの `STATUS` が `Running` に変わったら、以下のコマンドを使用してログを確認できます。

```
kubectl logs name_of_pod
```

`kubectl get pods` を実行すると、`STATUS` は `Completed` になります。

k8s クラスターの設定に関する詳細については、「[HyperPod k8s でのトレーニングジョブの実行](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)」を参照してください。