本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 使用 Python SageMaker SDK 使用 SMDDP 启动分布式训练作业要使用改编后的脚本运行分布式训练作业[调整训练脚本以使用 SMDDP 集体操作](data-parallel-modify-sdp-select-framework.md)，请使用 SageMaker Python SDK 的框架或通用估算器，将准备好的训练脚本指定为入口点脚本并将分布式训练配置指定为分布式训练配置。本页将引导你了解如何通过两种方式使用 [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/index.html)。 + 如果您想在 SageMaker AI 中快速采用分布式训练作业，请配置 A SageMaker I [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch)或[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator)框架估算器类。框架估算器会获取您的训练脚本，并根据为参数指定的值，自动匹配[预构建 PyTorch 或 TensorFlow 深度学习容器 (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) 的正确图像 URI。`framework_version` + 如果您想扩展其中一个预建容器或构建一个自定义容器来创建自己的带有 SageMaker AI 的机器学习环境，请使用 A SageMaker I 通用`Estimator`类并指定托管在亚马逊弹性容器注册表 (Amazon ECR) 中的自定义 Docker 容器的映像 URI。您的训练数据集应存储在启动训练作业的 A [mazon S3 或 Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) AWS 区域中。如果您使用 Jupyter 笔记本，则应在同一个 SageMaker 笔记本实例或 SageMaker Studio Classic 应用程序中运行。 AWS 区域有关存储训练数据的更多信息，请参阅 [SageMaker Python SDK 数据输入](https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-input)文档。 **提示** 我们建议您使用 Amazon f FSx or Lustre 而不是 Amazon S3 来提高训练绩效。与 Amazon S3 相比，Amazon FSx 具有更高的吞吐量和更低的延迟。 **提示** 要在启用 EFA 的实例类型上正确运行分布式训练，您应该通过设置 VPC 的安全组来启用实例之间的流量，允许所有进出安全组的流量。要了解如何设置安全组规则，请参阅[《Amazon EC2 用户指南》](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security)中的*步骤 1：准备启用 EFA 的安全组*。选择以下主题之一，了解如何运行训练脚本的分布式训练作业。启动训练作业后，您可以使用[Amazon SageMaker 调试器](train-debugger.md)或 Amazon 监控系统利用率和模型性能 CloudWatch。在您按照以下主题中的说明来详细了解技术细节时，我们还建议您尝试通过[Amazon SageMaker AI 数据并行库示例](distributed-data-parallel-v2-examples.md)开始试用。 **Topics** + [使用 Python SageMaker 软件开发工具包中的 PyTorch 框架估算器](data-parallel-framework-estimator.md) + [使用 SageMaker AI 通用估算器扩展预建的 DLC 容器](data-parallel-use-python-skd-api.md) + [使用 SageMaker AI 分布式数据并行库创建自己的 Docker 容器](data-parallel-bring-your-own-container.md) # 使用 Python SageMaker 软件开发工具包中的 PyTorch 框架估算器您可以通过向 SageMaker AI 框架估算器添加`distribution`参数来启动分布式训练，[https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch)或者。[https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator)有关更多详细信息，请从以下选项中选择 SageMaker AI 分布式数据并行性 (SMDDP) 库支持的框架之一。 ------ #### [ PyTorch ] 以下启动器选项可用于启动 PyTorch 分布式训练。 + `pytorchddp`— 此选项运行`mpirun`并设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。要使用此选项，请在 `distribution` 参数中输入以下字典。 ``` { "pytorchddp": { "enabled": True } } ``` + `torch_distributed`— 此选项运行`torchrun`并设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。要使用此选项，请在 `distribution` 参数中输入以下字典。 ``` { "torch_distributed": { "enabled": True } } ``` + `smdistributed`— 此选项也可以运行，`mpirun`但`smddprun`可以设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。 ``` { "smdistributed": { "dataparallel": { "enabled": True } } } ``` 如果您选择将 NCCL `AllGather` 替换为 SMDDP `AllGather`，则可以使用所有三个选项。选择一个适合您使用场景的选项。如果您选择用 SMDDP `AllReduce` 替换 NCCL `AllReduce`，则应选择基于 `mpirun` 的选项之一：`smdistributed` 或 `pytorchddp`。您还可以添加以下 MPI 选项。 ``` { "pytorchddp": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } ``` ``` { "smdistributed": { "dataparallel": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } } ``` 以下代码示例显示了具有分布式训练选项的 PyTorch 估计器的基本结构。 ``` from sagemaker.pytorch import PyTorch pt_estimator = PyTorch( base_job_name="training_job_name_prefix", source_dir="subdirectory-to-your-code", entry_point="adapted-training-script.py", role="SageMakerRole", py_version="py310", framework_version="2.0.1", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2, # Instance types supported by the SageMaker AI data parallel library: # ml.p4d.24xlarge, ml.p4de.24xlarge instance_type="ml.p4d.24xlarge", # Activate distributed training with SMDDP distribution={ "pytorchddp": { "enabled": True } } # mpirun, activates SMDDP AllReduce OR AllGather # distribution={ "torch_distributed": { "enabled": True } } # torchrun, activates SMDDP AllGather # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } # mpirun, activates SMDDP AllReduce OR AllGather ) pt_estimator.fit("s3://bucket/path/to/training/data") ``` **注意** PyTorch Lightning 及其实用程序库（例如 Lightning Bolts）未预装在 SageMaker AI PyTorch DLCs 中。创建以下 `requirements.txt` 文件，并将该文件保存到用于保存训练脚本的源目录中。 ``` # requirements.txt pytorch-lightning lightning-bolts ``` 例如，树结构目录应如下所示。 ``` ├── pytorch_training_launcher_jupyter_notebook.ipynb └── sub-folder-for-your-code ├── adapted-training-script.py └── requirements.txt ``` 有关指定存放`requirements.txt`文件以及训练脚本和作业提交的源目录的更多信息，请参阅 *Amazon A SageMaker I Python SDK 文档*中的[使用第三方库](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12)。 **启动 SMDDP 集体操作和使用正确的分布式训练启动器选项的考虑因素** + SMDDP `AllReduce` 和 SMDDP `AllGather` 目前并不相互兼容。 + 在使用 `smdistributed` 或 `pytorchddp`（基于 `mpirun` 的启动器）和 NCCL `AllGather` 时，SMDDP `AllReduce` 默认为激活状态。 + 使用 `torch_distributed` 启动器时，SMDDP `AllGather` 默认处于激活状态，而 `AllReduce` 则返回到 NCCL。 + 在使用基于 `mpirun` 的启动器时，还可以通过如下设置的附加环境变量激活 SMDDP `AllGather`。 ``` export SMDATAPARALLEL_OPTIMIZE_SDP=true ``` ------ #### [ TensorFlow ] **重要** 在 v2.11.0 之后，SMDDP 库已停止支持， TensorFlow 并且在 DLCs v2.11.0 TensorFlow 之后不再可用。要查找以前安装了 TensorFlow DLCs SMDDP 库的情况，请参阅。[TensorFlow （已弃用）](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks-tensorflow) ``` from sagemaker.tensorflow import TensorFlow tf_estimator = TensorFlow( base_job_name = "training_job_name_prefix", entry_point="adapted-training-script.py", role="SageMakerRole", framework_version="2.11.0", py_version="py38", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2, # Instance types supported by the SageMaker AI data parallel library: # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge instance_type="ml.p3.16xlarge", # Training using the SageMaker AI data parallel distributed training strategy distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ------ # 使用 SageMaker AI 通用估算器扩展预建的 DLC 容器您可以自定义 SageMaker AI 预先构建的容器或对其进行扩展，以处理预构建的 SageMaker AI Docker 镜像不支持的算法或模型的任何其他功能要求。有关如何扩展预构建容器的示例，请参阅[扩展预构建容器](https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html)。要扩展预构建的容器或调整您自己的容器以使用该库，您必须使用[支持的框架](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks)中列出的映像之一。 **注意** 从 TensorFlow 2.4.1 和 PyTorch 1.8.1 开始， SageMaker AI 框架 DLCs 支持启用 EFA 的实例类型。我们建议您使用包含 TensorFlow 2.4.1 或更高版本以及 PyTorch 1.8.1 或更高版本的 DLC 镜像。例如，如果您使用 PyTorch，则您的 Dockerfile 应包含类似于以下内容的`FROM`语句： ``` # SageMaker AI PyTorch image FROM 763104351884.dkr.ecr..amazonaws.com/pytorch-training: ENV PATH="/opt/ml/code:${PATH}" # this environment variable is used by the SageMaker AI PyTorch container to determine our user code directory. ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code # /opt/ml and all subdirectories are utilized by SageMaker AI, use the /code subdirectory to store your user code. COPY train.py /opt/ml/code/train.py # Defines cifar10.py as script entrypoint ENV SAGEMAKER_PROGRAM train.py ``` 您可以使用[SageMaker 训练工具包](https://github.com/aws/sagemaker-training-toolkit)和 SageMaker AI 分布式数据 parallel 库的二进制文件进一步自定义自己的 Docker 容器以与 SageMaker AI 配合使用。要了解更多信息，请参阅以下部分中的说明。 # 使用 SageMaker AI 分布式数据并行库创建自己的 Docker 容器要构建自己的 Docker 容器进行训练并使用 SageMaker AI 数据并行库，您必须在 Dockerfile 中包含正确的依赖项和 SageMaker AI 分布式并行库的二进制文件。本节提供有关如何使用数据 parallel 库创建具有最小依赖关系的完整 Dockerfile，以便在 SageMaker AI 中进行分布式训练。 **注意** 此自定义 Docker 选项将 SageMaker AI 数据并行库作为二进制文件仅适用于。 PyTorch **使用 SageMaker 训练工具包和数据并行库创建 Dockerfile** 1. 使用 [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda) 提供的 Docker 映像开始。[使用包含 CUDA 运行时和开发工具（头文件和库）的 cuDNN 开发者版本从源代码进行构建。PyTorch ](https://github.com/pytorch/pytorch#from-source) ``` FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 ``` **提示** 官方的 AWS 深度学习容器 (DLC) 镜像是基于 [NVIDIA CUDA 基础](https://hub.docker.com/r/nvidia/cuda)镜像构建的。如果你想在按照其余说明进行操作的同时使用预构建的 DLC 镜像作为参考，请参阅[适用 PyTorch 于 Dockerfiles 的 Dee AWS p Learning Container](https://github.com/aws/deep-learning-containers/tree/master/pytorch) s。 1. 添加以下参数以指定软件包 PyTorch 和其他软件包的版本。此外，请指明指向 A SageMaker I 数据并行库和其他使用 AWS 资源的软件（例如 Amazon S3 插件）的 Amazon S3 存储桶路径。要使用除以下代码示例中提供的版本之外的第三方库版本，我们建议您查看[AWS 深度学习容器的官方 Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker)， PyTorch以查找经过测试、兼容且适合您的应用程序的版本。要 URLs 查找`SMDATAPARALLEL_BINARY`参数，请参阅中的查找表[支持的框架](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks)。 ``` ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws ``` 1. 设置以下环境变量以正确构建 SageMaker 训练组件并运行数据 parallel 库。在后续步骤中，您将为组件使用这些变量。 ``` # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH ``` 1. 在后续步骤中安装或更新 `curl`、`wget` 和 `git`，以下载和构建软件包。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* ``` 1. 安装用于 Amazon EC2 网络通信的 [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) 软件。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa ``` 1. 安装 [Conda](https://docs.conda.io/en/latest/) 来处理软件包管理。 ``` RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya ``` 1. 获取、构建、安装 PyTorch 及其依赖关系。我们[使用源代码进行](https://github.com/pytorch/pytorch#from-source)构建PyTorch ，因为我们需要控制 NCCL 版本以保证与 [AWS OFI](https://github.com/aws/aws-ofi-nccl) NCCL 插件的兼容性。 1. 按照[PyTorch 官方 dockerfil](https://github.com/pytorch/pytorch/blob/master/Dockerfile) e 中的步骤，安装构建依赖项并设置 [ccache](https://ccache.dev/) 以加快重新编译速度。 ``` RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache ``` 1. 安装程序[PyTorch的常见依赖项和 Linux 依赖项](https://github.com/pytorch/pytorch#install-dependencies)。 ``` # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 ``` 1. 克隆[PyTorch GitHub存储库](https://github.com/pytorch/pytorch)。 ``` RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} ``` 1. 安装并构建特定的 [NCCL](https://developer.nvidia.com/nccl) 版本。为此，请将默认 NCCL 文件夹 (`/pytorch/third_party/nccl`) 中的内容替换为 NVIDIA 存储库中的特定 NCCL 版本。 PyTorchNCCL 版本在本指南的第 3 步中设置。 ``` RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 ``` 1. 构建并安装 PyTorch。完成此过程通常需要 1 个小时多一点。它使用上一步中下载的 NCCL 版本构建。 ``` RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch ``` 1. 构建并安装 [AWS OFI NCCL 插件](https://github.com/aws/aws-ofi-nccl)。这使得 [libfabric](https://github.com/ofiwg/libfabric) 支持 SageMaker 人工智能数据并行库。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl ``` 1. 构建并安装[TorchVision](https://github.com/pytorch/vision.git)。 ``` RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision ``` 1. 安装和配置 OpenSSH。MPI 需要使用 OpenSSH 在容器之间进行通信。允许 OpenSSH 与容器通信而无需请求确认。 ``` RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config ``` 1. 安装 PT S3 插件以高效访问 Amazon S3 中的数据集。 ``` RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt ``` 1. 安装 [libboost](https://www.boost.org/) 库。此软件包是将 SageMaker AI 数据并行库的异步 IO 功能联网所必需的。 ``` WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost ``` 1. 安装以下 SageMaker AI 工具进行 PyTorch 训练。 ``` WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training ``` 1. 最后，安装 SageMaker AI 数据 parallel 二进制文件和其余依赖项。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` 1. 创建 Dockerfile 后，请参阅[调整自己的训练容器，了解如何构建 Docker 容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)，将其托管在 Amazon ECR 中，以及如何使用 Python SDK 运行训练作业。 SageMaker 以下示例代码显示了将之前所有代码块组合在一起的完整 Dockerfile。 ``` # This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # Set appropiate versions and location for components ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH # Install basic dependencies to download and build other dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* # Install EFA. # This is required for SMDDP backend communication RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa # Install Conda RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya # Install PyTorch. # Start with dependencies listed in official PyTorch dockerfile # https://github.com/pytorch/pytorch/blob/master/Dockerfile RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev && \ rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 # Clone PyTorch RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} # Note that we need to use the same NCCL version for PyTorch and OFI plugin. # To enforce that, install NCCL from source before building PT and OFI plugin. # Install NCCL. # Required for building OFI plugin (OFI requires NCCL's header files and library) RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 # Build and install PyTorch. RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch RUN ccache -C # Build and install OFI plugin. \ # It is required to use libfabric. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl # Build and install Torchvision RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision # Install OpenSSH. # Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config # Install PT S3 plugin. # Required to efficiently access datasets in Amazon S3 RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt # Install libboost from source. # This package is needed for smdataparallel functionality (for networking asynchronous IO). WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost # Install SageMaker AI PyTorch training. WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training # Install SageMaker AI data parallel binary (SMDDP) # Start with dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* # Install SMDDP RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` **提示** 有关创建用于 SageMaker 人工智能训练的自定义 Dockerfile 的更多一般信息，请参阅[使用自己的训练算](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)法。 **提示** 如果要扩展自定义 Dockerfile 以纳入 SageMaker AI 模型并行库，请参阅。[使用 SageMaker 分布式模型并行库创建自己的 Docker 容器](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container)