GPU SageMaker 训练作业环境设置使用 Jupyter Notebook 启动训练作业使用配方启动程序启动训练作业

SageMaker 训练作业预训练教程（GPU）

本教程将指导您完成使用 SageMaker 训练作业与 GPU 实例设置并运行预训练作业的过程。

设置环境
使用 SageMaker HyperPod 配方启动训练作业

在开始之前，确保您满足以下先决条件。

先决条件

在开始设置环境之前，请确保您：

Amazon FSx 文件系统或 Amazon S3 存储桶，可在其中加载数据和输出训练构件。
已在 Amazon SageMaker AI 上请求 1 个 ml.p4d.24xlarge 和 1 个 ml.p5.48xlarge 的服务配额。要请求增加服务配额，请执行以下操作：
1. 在 AWS 服务配额控制台上，导航到 AWS 服务，
2. 选择 Amazon SageMaker AI。
3. 选择一个 ml.p4d.24xlarge 和一个 ml.p5.48xlarge 实例。
使用以下托管策略创建一个 AWS Identity and Access Management（IAM）角色，以向 SageMaker AI 授予运行示例的权限。
- AmazonSageMakerFullAccess
- AmazonEC2FullAccess
拥有采用以下格式之一的数据：
- JSON
- JSONGZ（压缩 JSON）
- ARROW
（可选）如果您使用 HuggingFace 中的模型权重进行预训练或微调，则必须获得 HuggingFace 令牌。有关获取令牌的更多信息，请参阅用户访问令牌。

GPU SageMaker 训练作业环境设置

在运行 SageMaker 训练作业之前，请通过运行 aws configure 命令来配置 AWS 凭证和首选区域。作为配置命令的替代方案，可通过环境变量（例如 AWS_ACCESS_KEY_ID、AWS_SECRET_ACCESS_KEY 和 AWS_SESSION_TOKEN.）提供凭证。有关更多信息，请参阅 SageMaker AI Python SDK。

我们强烈建议在 SageMaker AI JupyterLab 中使用 SageMaker AI Jupyter Notebook 启动 SageMaker 训练作业。有关更多信息，请参阅SageMaker JupyterLab。

（可选）设置虚拟环境和依赖项。如果您在 Amazon SageMaker Studio 中使用 Jupyter Notebook，则可以跳过此步骤。确保您使用 Python 3.9 或更高版本。


# set up a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate
# install dependencies after git clone.

git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt
# Set the aws region.

aws configure set <your_region>

安装 SageMaker AI Python SDK
```
pip3 install --upgrade sagemaker
```
Container：GPU 容器由 SageMaker AI Python SDK 自动设置。您也可以提供自己的容器。

注意
如果您正在运行 Llama 3.2 多模态训练作业，则 transformers 版本必须为 4.45.2 版或更高版本。

仅在使用 SageMaker AI Python SDK 时将 transformers==4.45.2 追加到 source_dir 中的 requirements.txt。例如，如果您在 SageMaker AI JupyterLab Notebook 中使用它，则可追加它。

如果您使用 HyperPod 配方以通过集群类型 sm_jobs 启动，则此操作将自动完成。

使用 Jupyter Notebook 启动训练作业

您可以使用以下 Python 代码通过配方运行 SageMaker 训练作业。它利用 SageMaker AI Python SDK 中的 PyTorch 估算器来提交配方。以下示例在 SageMaker AI 训练平台上启动 llama3-8b 配方。


import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)

上一代码使用训练配方创建一个 PyTorch 估算器对象，然后使用 fit() 方法对模型进行拟合。使用 training_recipe 参数指定要用于训练的配方。

注意

如果您正在运行 Llama 3.2 多模态训练作业，则转换器版本必须为 4.45.2 版或更高版本。

仅在直接使用 SageMaker AI Python SDK 时将 transformers==4.45.2 追加到 source_dir 中的 requirements.txt。例如，在使用 Jupyter Notebook 时，必须将版本追加到文本文件。

在为 SageMaker 训练作业部署端点时，必须指定正在使用的映像 URI。如果不提供映像 URI，则估算器将使用训练映像作为部署的映像。SageMaker HyperPod 提供的训练映像不包含推理和部署所需的依赖项。以下示例说明如何将推理映像用于部署：


from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)

注意

在 SageMaker Notebook 实例上运行上述代码所需的存储空间可能会超过 SageMaker AI JupyterLab 默认提供的 5GB 存储空间。如果您遇到空间不足的问题，请创建一个新的笔记本实例，您可在该实例中使用其他笔记本实例，并增加笔记本的存储空间。

使用配方启动程序启动训练作业

更新 ./recipes_collection/cluster/sm_jobs.yaml 文件，使其看起来与以下内容类似：


sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"

更新 ./recipes_collection/config.yaml 以在 cluster 和 cluster_type 中指定 sm_jobs。


defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.

通过以下命令启动作业


python3 main.py --config-path recipes_collection --config-name config

有关配置 SageMaker 训练作业的更多信息，请参阅“在 SageMaker 训练作业上运行训练作业”。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

使用 Kubernetes 集群的 Trainium 预训练

使用 SageMaker 作业的 Trainium 预训练