本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 第 2 步：使用 SageMaker Python 软件开发工具包启动训练 Job SageMaker Python SDK 支持使用机器学习框架（例如 TensorFlow 和）对模型进行托管训练 PyTorch。[要使用其中一个框架启动训练作业，您需要定义估计器、 SageMaker [TensorFlow 估计器或](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) SageMaker 通用 E SageMaker [PyTorch stimat](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) or 以使用修改后的训练脚本和模型并行配置。](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/estimators.html#sagemaker.estimator.Estimator) **Topics** + [使用 SageMaker TensorFlow 和 PyTorch 估算器](#model-parallel-using-sagemaker-pysdk) + [扩展包含分布式模型并行库的预构建 SageMaker的 Docker 容器](#model-parallel-customize-container) + [使用 SageMaker 分布式模型并行库创建自己的 Docker 容器](#model-parallel-bring-your-own-container) ## 使用 SageMaker TensorFlow 和 PyTorch 估算器 TensorFlow 和 PyTorch estimator 类包含`distribution`参数，您可以使用该参数来指定使用分布式训练框架的配置参数。 SageMaker 模型并行库内部使用 MPI 来处理混合数据和模型并行性，因此必须在该库中使用 MPI 选项。以下 TensorFlow PyTorch 或估算器模板显示了如何配置`distribution`参数，以便在 MPI 中使用模型 SageMaker 并行库。 ------ #### [ Using the SageMaker TensorFlow estimator ] ``` import sagemaker from sagemaker.tensorflow import TensorFlow smp_options = { "enabled":True, # Required "parameters": { "partitions": 2, # Required "microbatches": 4, "placement_strategy": "spread", "pipeline": "interleaved", "optimize": "speed", "horovod": True, # Use this for hybrid model and data parallelism } } mpi_options = { "enabled" : True, # Required "processes_per_host" : 8, # Required # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none" } smd_mp_estimator = TensorFlow( entry_point="your_training_script.py", # Specify your train script source_dir="location_to_your_script", role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.p3.16xlarge', framework_version='2.6.3', py_version='py38', distribution={ "smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options }, base_job_name="SMD-MP-demo", ) smd_mp_estimator.fit('s3://my_bucket/my_training_data/') ``` ------ #### [ Using the SageMaker PyTorch estimator ] ``` import sagemaker from sagemaker.pytorch import PyTorch smp_options = { "enabled":True, "parameters": { # Required "pipeline_parallel_degree": 2, # Required "microbatches": 4, "placement_strategy": "spread", "pipeline": "interleaved", "optimize": "speed", "ddp": True, } } mpi_options = { "enabled" : True, # Required "processes_per_host" : 8, # Required # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none" } smd_mp_estimator = PyTorch( entry_point="your_training_script.py", # Specify your train script source_dir="location_to_your_script", role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.p3.16xlarge', framework_version='1.13.1', py_version='py38', distribution={ "smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options }, base_job_name="SMD-MP-demo", ) smd_mp_estimator.fit('s3://my_bucket/my_training_data/') ``` ------ 要启用该库，您需要通过 SageMaker 估计器构造函数的`distribution`参数将配置字典传递给`"smdistributed"`和`"mpi"`键。 **SageMaker 模型并行度的配置参数** + 对于 `"smdistributed"` 键，用 `"modelparallel"` 键和以下内部字典传递字典。 **注意** 不支持在一个训练作业中使用 `"modelparallel"` 和 `"dataparallel"`。 + `"enabled"` – 必需。要启用模型并行性，请设置 `"enabled": True`。 + `"parameters"` – 必需。为 SageMaker 模型并行度指定一组参数。 + 有关常用参数的完整列表，请参阅 *SageMaker Python SDK 文档`smdistributed`*中的[参数](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters)。有关信息 TensorFlow，请参阅[TensorFlow特定参数](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#tensorflow-specific-parameters)。有关信息 PyTorch，请参阅[PyTorch特定参数](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#pytorch-specific-parameters)。 + `"pipeline_parallel_degree"`（在 `smdistributed-modelparallel 要扩展预先构建 SageMaker的容器并使用其模型并行度库，您必须将其中一个可用的 AWS 深度学习容器 (DLC) 图像用于或。 PyTorch TensorFlow SageMaker 模型并行度库包含在带有 CUDA () 的 TensorFlow （2.3.0 及更高版本）和 PyTorch （1.6.0 及更高版本）DLC 镜像中。`cuxyz`有关 DLC 图像的完整列表，请参阅 Deep Learnin [g Containers * GitHub 存储库中可用的AWS 深度学习容器*镜像](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)。 **提示** 我们建议您使用包含最新版本的图像 TensorFlow或 PyTorch 访问 SageMaker 模型并行度库的最新 up-to-date版本。例如，您的 Dockerfile 应该包含类似于下文的 `FROM` 语句： ``` # Use the SageMaker DLC image URI for TensorFlow or PyTorch FROM aws-dlc-account-id.dkr.ecr.aws-region.amazonaws.com/framework-training:{framework-version-tag} # Add your dependencies here RUN ... ENV PATH="/opt/ml/code:${PATH}" # this environment variable is used by the SageMaker AI container to determine our user code directory. ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code ``` 此外，在定义 PyTorch TensorFlow 或估计器时，必须`entry_point`为训练脚本指定。这应该与 `ENV SAGEMAKER_SUBMIT_DIRECTORY` 在 Dockerfile 中标识的路径相同。 **提示** 你必须将此 Docker 容器推送到亚马逊弹性容器注册表 (Amazon ECR) Registry，然后使用图像 URI (`image_uri`) 来定义训练的估算器。 SageMaker 有关更多信息，请参阅 [扩展预构建容器](prebuilt-containers-extend.md)。托管 Docker 容器并检索容器的镜像 URI 后，按如下方式创建一个 SageMaker `PyTorch`估算器对象。本例假设您已经定义 `smp_options` 和 `mpi_options`。 ``` smd_mp_estimator = Estimator( entry_point="your_training_script.py", role=sagemaker.get_execution_role(), instance_type='ml.p3.16xlarge', sagemaker_session=sagemaker_session, image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag' instance_count=1, distribution={ "smdistributed": smp_options, "mpi": mpi_options }, base_job_name="SMD-MP-demo", ) smd_mp_estimator.fit('s3://my_bucket/my_training_data/') ``` ## 使用 SageMaker 分布式模型并行库创建自己的 Docker 容器要构建自己的 Docker 容器进行训练并使用 SageMaker 模型并行库，您必须在 Dockerfile 中包含正确的依赖项和 SageMaker分布式并行库的二进制文件。本节提供了在自己的 Docker 容器中正确准备 SageMaker 训练环境和模型 parallel 库时必须包含的最少代码块。 **注意** 这个带有 SageMaker 模型并行库作为二进制文件的自定义 Docker 选项仅适用于。 PyTorch **使用 SageMaker 训练工具包和模型并行库创建 Dockerfile** 1. 从 [NVIDIA CUDA 基础映像](https://hub.docker.com/r/nvidia/cuda)之一开始使用。 ``` FROM ``` **提示** 官方的 AWS 深度学习容器 (DLC) 镜像是基于 [NVIDIA CUDA 基础](https://hub.docker.com/r/nvidia/cuda)镜像构建的。我们建议您查看[AWS 深度学习容器的官方 Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker)， PyTorch以了解需要安装哪些版本的库以及如何配置它们。官方 Dockerfile 已完成，经过基准测试并由深度学习容器服务团队 SageMaker 和深度学习容器服务团队管理。在提供的链接中，选择您使用的 PyTorch版本，选择 CUDA (`cuxyz`) 文件夹，然后选择以或结尾的 Dockerfile。`.gpu` `.sagemaker.gpu` 1. 要设置分布式训练环境，您需要为通信和网络设备安装软件，例如 [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html)、[NVIDIA Collective Communications Library (NCCL)](https://developer.nvidia.com/nccl) 和 [Open MPI](https://www.open-mpi.org/)。根据您选择的 PyTorch 和 CUDA 版本，必须安装兼容版本的库。 **重要** 由于 SageMaker 模型并行库在后续步骤中需要 SageMaker 数据并行库，因此我们强烈建议您按照中的说明正确设置分布式 SageMaker 训练的训练环境。[使用 SageMaker AI 分布式数据并行库创建自己的 Docker 容器](data-parallel-bring-your-own-container.md) 有关使用 NCCL 和 Open MPI 设置 EFA 的更多信息，请参阅[开始使用 EFA 和 MPI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) 以及[开始使用 EFA 和 NCCL](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html)。 1. 添加以下参数来指定 URLs SageMaker 分布式训练包的 PyTorch。 SageMaker 模型并行库要求 SageMaker 数据并行库使用跨节点远程直接内存访问 (RDMA)。 ``` ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ``` 1. 安装 SageMaker 模型并行库所需的依赖项。 1. 安装 [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) 库。 ``` ARG METIS=metis-5.1.0 RUN rm /etc/apt/sources.list.d/* \ && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \ && gunzip -f ${METIS}.tar.gz \ && tar -xvf ${METIS}.tar \ && cd ${METIS} \ && apt-get update \ && make config shared=1 \ && make install \ && cd .. \ && rm -rf ${METIS}.tar* \ && rm -rf ${METIS} \ && rm -rf /var/lib/apt/lists/* \ && apt-get clean ``` 1. 安装 [RAPIDS Memory Manager 库](https://github.com/rapidsai/rmm#rmm-rapids-memory-manager)。这需要 [CMake](https://cmake.org/)3.14 或更高版本。 ``` ARG RMM_VERSION=0.15.0 RUN wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \ && tar -xvf v${RMM_VERSION}.tar.gz \ && cd rmm-${RMM_VERSION} \ && INSTALL_PREFIX=/usr/local ./build.sh librmm \ && cd .. \ && rm -rf v${RMM_VERSION}.tar* \ && rm -rf rmm-${RMM_VERSION} ``` 1. 安装 SageMaker 模型并行库。 ``` RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL} ``` 1. 安装 SageMaker 数据并行库。 ``` RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` 1. 安装 [sagemaker-training 工具包](https://github.com/aws/sagemaker-training-toolkit)。该工具包包含创建与 SageMaker 训练平台和 SageMaker Python SDK 兼容的容器所必需的常用功能。 ``` RUN pip install sagemaker-training ``` 1. 完成 Dockerfile 的创建后，请参阅[调整自己的训练容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)，了解如何构建 Docker 容器并将其托管在 Amazon ECR 中。 **提示** 有关创建用于 SageMaker 人工智能训练的自定义 Dockerfile 的更多一般信息，请参阅[使用自己的训练算](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)法。