本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。 # 使用 SageMaker AI 分散式資料平行程式庫建立您自己的 Docker 容器若要建置自己的 Docker 容器以進行訓練並使用 SageMaker AI 資料平行程式庫，您必須在 Docerfile 中包含 SageMaker AI 分散式平行程式庫的正確相依性和二進位檔案。本節提供有關如何使用資料平行程式庫在 SageMaker AI 中針對分散式訓練建立具有最小相依性集的完整 Dockerfile 之指示。 **注意** 此自訂 Docker 選項將 SageMaker AI 資料平行程式庫做為二進位檔案庫，僅適用於 PyTorch。 **若要使用 SageMaker 訓練工具組和資料平行程式庫建立 Dockerfile** 1. 從 [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda) 的 Docker 映像開始。使用包含 CUDA 執行期和開發工具 (標題和程式庫) 的 cuDNN 開發人員版本，從 [PyTorch 原始程式碼](https://github.com/pytorch/pytorch#from-source)進行建置。 ``` FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 ``` **提示** 官方 AWS 深度學習容器 (DLC) 映像是從 [NVIDIA CUDA 基礎映像](https://hub.docker.com/r/nvidia/cuda)建置而成。如想使用預先建置的 DLC 映像做為參考，同時遵循其餘指示，請參閱[適用於 PyTorch Dockerfiles 的AWS 深度學習容器](https://github.com/aws/deep-learning-containers/tree/master/pytorch)。 1. 新增下列引數以指定 PyTorch 和其他套件的版本。此外，請指出到 SageMaker AI 資料平行程式庫的 Amazon S3 儲存貯體路徑，以及其他軟體以使用 AWS 資源 (例如 Amazon S3 外掛程式)。若要使用下列程式碼範例中提供的版本以外的第三方程式庫版本，建議您查看適用於 [ PyTorch 的 AWS Deep Learning Container 官方 Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker)，以尋找經過測試、相容且適合您的應用程式的版本。若要尋找 `SMDATAPARALLEL_BINARY` 引數的 URL，請參閱[支援的架構](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks)中的查詢資料表。 ``` ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws ``` 1. 設定下列環境變數，以妥善建置 SageMaker 訓練元件，並執行資料平行程式庫。您可以在後續步驟中將這些變數用於元件。 ``` # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH ``` 1. 在後續步驟中安裝或更新 `curl`、`wget` 和 `git`，以下載並建置套件。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* ``` 1. 為 Amazon EC2 網路通訊安裝 [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) 軟體。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa ``` 1. 安裝 [Conda](https://docs.conda.io/en/latest/) 以處理套件管理。 ``` RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya ``` 1. 獲取、建置和安裝 PyTorch 及其相依項。我們[從原始程式碼建置 PyTorch](https://github.com/pytorch/pytorch#from-source)，因為我們需要控制 NCCL 版本，以確保與 [AWS OFI NCCL 外掛程式](https://github.com/aws/aws-ofi-nccl)的相容性。 1. 按照 [PyTorch 官方 Dockerfile](https://github.com/pytorch/pytorch/blob/master/Dockerfile) 中的步驟，安裝建置相依項並設定 [ccache](https://ccache.dev/) 以加速重新編譯。 ``` RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache ``` 1. 安裝 [PyTorch 常見的和 Linux 的相依項](https://github.com/pytorch/pytorch#install-dependencies)。 ``` # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 ``` 1. 複製 [PyTorch GitHub 儲存器](https://github.com/pytorch/pytorch)。 ``` RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} ``` 1. 安裝並建置特定的 [NCCL](https://developer.nvidia.com/nccl) 版本。若要這麼做，請將 PyTorch 的預設 NCCL 資料夾 (`/pytorch/third_party/nccl`) 中的內容取代為 NVIDIA 儲存庫中的特定 NCCL 版本。NCCL 版本已在本指南的步驟 3 中設定。 ``` RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 ``` 1. 建置和安裝 PyTorch。此程序通常需稍微超過 1 小時才能完成。它是使用上一個步驟中所下載的 NCCL 版本。 ``` RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch ``` 1. 建置並安裝 [AWS OFI NCCL 外掛程式](https://github.com/aws/aws-ofi-nccl)。這會啟用 SageMaker AI 資料平行程式庫的 [libfabric](https://github.com/ofiwg/libfabric) 支援。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl ``` 1. 建置並安裝 [TorchVision](https://github.com/pytorch/vision.git)。 ``` RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision ``` 1. 安裝及設定 OpenSSH。MPI 需要 OpenSSH 才能在容器之間進行通訊。允許 OpenSSH 不必要求確認就會與容器通話。 ``` RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config ``` 1. 安裝 PT S3 外掛程式，以有效率地存取 Amazon S3 中的資料集。 ``` RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt ``` 1. 安裝 [libboost](https://www.boost.org/) 程式庫。此套件需要用來連接 SageMaker AI 資料平行程式庫的非同步 IO 功能。 ``` WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost ``` 1. 安裝下列 SageMaker AI 工具以進行 PyTorch 訓練。 ``` WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training ``` 1. 最後，安裝 SageMaker AI 資料平行二進位檔案及其餘的相依項。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` 1. 建立完 Dockerfile 後，請參閱[調整您自己的訓練容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)，以了解如何建置 Docker 容器、在 Amazon ECR 中託管，以及使用 SageMaker Python SDK 執行訓練任務。以下範例程式碼顯示了一個完整的 Dockerfile 於合併所有先前的程式碼區塊的情況。 ``` # This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # Set appropiate versions and location for components ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH # Install basic dependencies to download and build other dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* # Install EFA. # This is required for SMDDP backend communication RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa # Install Conda RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya # Install PyTorch. # Start with dependencies listed in official PyTorch dockerfile # https://github.com/pytorch/pytorch/blob/master/Dockerfile RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev && \ rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 # Clone PyTorch RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} # Note that we need to use the same NCCL version for PyTorch and OFI plugin. # To enforce that, install NCCL from source before building PT and OFI plugin. # Install NCCL. # Required for building OFI plugin (OFI requires NCCL's header files and library) RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 # Build and install PyTorch. RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch RUN ccache -C # Build and install OFI plugin. \ # It is required to use libfabric. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl # Build and install Torchvision RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision # Install OpenSSH. # Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config # Install PT S3 plugin. # Required to efficiently access datasets in Amazon S3 RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt # Install libboost from source. # This package is needed for smdataparallel functionality (for networking asynchronous IO). WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost # Install SageMaker AI PyTorch training. WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training # Install SageMaker AI data parallel binary (SMDDP) # Start with dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* # Install SMDDP RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` **提示** 如需有關在 SageMaker AI 中建立自訂 Dockerfile 以進行訓練的詳細資訊，請參閱[使用您自己的訓練演算法](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)。 **提示** 如果您想要延伸自訂 Dockerfile 以納入 SageMaker AI 模型平行程式庫，請參閱[使用 SageMaker 分散式模型平行程式庫建立您自己的 Docker 容器](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container)。