Trainium SageMaker トレーニングジョブの環境を設定する Jupyter Notebook を使用してトレーニングジョブを起動するレシピランチャーを使用してトレーニングジョブを起動する

Trainium SageMaker トレーニングジョブの事前トレーニングチュートリアル

このチュートリアルでは、AWS Trainium インスタンスで SageMaker トレーニングジョブを使用して事前トレーニングジョブを設定して実行するプロセスについて説明します。

環境をセットアップします。
トレーニングジョブを起動する

開始する前に、以下の前提条件を満たしていることを確認します。

前提条件

環境のセットアップを開始する前に、以下を確認します。

データをロードしてトレーニングアーティファクトを出力できる、Amazon FSx ファイルシステム、または Amazon S3 バケットがあること。
Amazon SageMaker AI で ml.trn1.32xlarge のサービスクォータをリクエスト済み。サービスクォータの引き上げをリクエストするには、次のいずれかを行います。
ml.trn1.32xlarge のサービスクォータの引き上げをリクエストするには
1. AWS Service Quotas コンソールに移動します。
2. AWS サービスを選択します。
3. [JupyterLab] を選択します。
4. ml.trn1.32xlarge に 1 つのインスタンスを指定します。
AmazonSageMakerFullAccess マネージドポリシーと AmazonEC2FullAccess マネージドポリシーを持つ AWS Identity and Access Management (IAM) ロールを作成します。これらのポリシーは、Amazon SageMaker AI に例を実行するアクセス許可を提供します。
以下の形式のいずれか。
- JSON
- JSONGZ (圧縮 JSON)
- ARROW
(オプション) HuggingFace から事前にトレーニングされた重みが必要な場合、または Llama 3.2 モデルをトレーニングしている場合は、トレーニングを開始する前に HuggingFace トークンを取得する必要があります。アクセストークンの詳細については、「ユーザーアクセストークン」を参照してください。

Trainium SageMaker トレーニングジョブの環境を設定する

SageMaker トレーニングジョブを実行する前に、aws configure コマンドを使用して、AWS 認証情報と優先リージョンを設定します。代わりに、AWS_ACCESS_KEY_ID、AWS_SECRET_ACCESS_KEY、AWS_SESSION_TOKEN などの環境変数を介して認証情報を提供することもできます。詳細については「 SageMaker AI の Python SDK」を参照してください。

SageMaker AI JupyterLab で SageMaker AI JupyterLab Notebook を使用して SageMaker トレーニングジョブを起動することを強くお勧めします。詳細については、「SageMaker JupyterLab」を参照してください。

(オプション) Amazon SageMaker Studio で Jupyter Notebook を使用している場合は、以下のコマンドの実行をスキップできます。必ず Python 3.9 以降のバージョンを使用してください。


# set up a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate
# install dependencies after git clone.

git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

SageMaker AI Python SDK をインストールします。
```
pip3 install --upgrade sagemaker
```
- Llama 3.2 マルチモーダルトレーニングジョブを実行している場合、transformers バージョンは 4.45.2 以降である必要があります。
  - SageMaker AI Python SDK を使用している場合にのみ、source_dir で transformers==4.45.2 を requirements.txt の末尾に追加します。
  - HyperPod レシピを使用してクラスタータイプとして sm_jobs を使用して起動する場合は、トランスフォーマーのバージョンを指定する必要はありません。
- Container: Neuron コンテナは SageMaker AI Python SDK が自動的に設定します。

Jupyter Notebook を使用してトレーニングジョブを起動する

次の Python コードを使用すると、レシピで SageMaker トレーニングジョブを実行できます。SageMaker AI Python SDKの PyTorch 推定ツールを活用してレシピを送信します。次の例では、SageMaker AI トレーニングジョブとして llama3-8b レシピを起動します。

compiler_cache_url: Amazon S3 アーティファクトなどのコンパイル済みアーティファクトを保存するために使用するキャッシュ。


import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)

上記のコードは、トレーニングレシピを使用して PyTorch 推定ツールオブジェクトを作成し、fit() メソッドを使用してモデルに適合させます。training_recipe パラメータを使用して、トレーニングに使用するレシピを指定します。

レシピランチャーを使用してトレーニングジョブを起動する

./recipes_collection/cluster/sm_jobs.yaml の更新

compiler_cache_url: アーティファクトの保存に使用される URL。Amazon S3 URL にすることもできます。


sm_jobs_config:
  output_path: <s3_output_path>
  wait: True
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    image_uri: <your_image_uri>
    enable_remote_debug: True
    py_version: py39
  recipe_overrides:
    model:
      exp_manager:
        exp_dir: <exp_dir>
      data:
        train_dir: /opt/ml/input/data/train
        val_dir: /opt/ml/input/data/val

./recipes_collection/config.yaml の更新


defaults:
  - _self_
  - cluster: sm_jobs
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.

instance_type: ml.trn1.32xlarge
base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.

main.py を使用してジョブを起動する


python3 main.py --config-path recipes_collection --config-name config

SageMaker のトレーニングジョブの設定の詳細については、「SageMaker トレーニングジョブの事前トレーニングのチュートリアル (GPU)」を参照してください。

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

SageMaker ジョブを使用した GPU 事前トレーニング

デフォルト設定

Trainium SageMaker トレーニングジョブの事前トレーニングチュートリアル

前提条件

ml.trn1.32xlarge のサービスクォータの引き上げをリクエストするには

Trainium SageMaker トレーニングジョブの環境を設定する

Jupyter Notebook を使用してトレーニングジョブを起動する

レシピランチャーを使用してトレーニングジョブを起動する