在 SageMaker Python SDK 中使用 PyTorch 架構估算器

您可以將 distribution 引數新增至 SageMaker AI 架構估算器 PyTorch 或 TensorFlow，以啟動分散式訓練。如需詳細資訊，請從下列選項中選擇 SageMaker AI 分散式資料平行化 (SMDDP) 程式庫支援的其中一個架構。

PyTorch

下列啟動器選項可用於啟動 PyTorch 分散式訓練。

pytorchddp – 此選項會執行 mpirun 並設定在 SageMaker AI 上執行 PyTorch 分散式訓練所需的環境變數。若要使用此選項，請將下列字典傳遞至 distribution 參數。
```
{ "pytorchddp": { "enabled": True } }
```
torch_distributed – 此選項會執行 torchrun 並設定在 SageMaker AI 上執行 PyTorch 分散式訓練所需的環境變數。若要使用此選項，請將下列字典傳遞至 distribution 參數。
```
{ "torch_distributed": { "enabled": True } }
```
smdistributed – 此選項也會執行 mpirun，但使用 smddprun 設定在 SageMaker AI 上執行 PyTorch 分散式訓練所需的環境變數。
```
{ "smdistributed": { "dataparallel": { "enabled": True } } }
```

如果您選擇將 NCCL AllGather 取代為 SMDDP AllGather，您可以使用全部三個選項。選擇一個符合您使用案例的選項。

如果您選擇將 NCCL AllReduce 取代為 SMDDP AllReduce，您應該選擇其中一個 mpirun 型選項：smdistributed 或 pytorchddp。您也可以新增其他 MPI 選項，如下所示。


{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}


{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}

下列程式碼範例顯示具有分散式訓練選項的 PyTorch 估算器的基本結構。


from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")

注意

PyTorch Lightning 及其公用程式庫 (例如 Lightning Bolts) 未預先安裝在 SageMaker AI PyTorch DLC 中。建立下列 requirements.txt 檔案並儲存在存放訓練指令碼的來源目錄中。


# requirements.txt
pytorch-lightning
lightning-bolts

例如，tree-structured 目錄看起來應該如下所示。


├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt

如需指定來源目錄以放置 requirements.txt 檔案和訓練指令碼以及工作提交的更多相關資訊，請參閱 Amazon SageMaker AI Python SDK 文件中的使用第三方程式庫。

啟用 SMDDP 集合操作和使用正確分散式訓練啟動器選項的考量

SMDDP AllReduce 和 SMDDP AllGather 目前不可相互相容。
使用 smdistributed 或 pytorchddp (mpirun 型啟動器) 時，預設會啟用 SMDDP AllReduce，並使用 NCCL AllGather。
使用 torch_distributed 啟動器時，預設會啟用 SMDDP AllGather，且 AllReduce 會回復為 NCCL。
使用 mpirun 型啟動器搭配額外的環境變數集時，SMDDP AllGather 也會啟用，如下所示。
```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

TensorFlow

重要

SMDDP 程式庫已停止對 TensorFlow 的支援，且不再於 v2.11.0 之後的 TensorFlow DLC 中提供。若要尋找已安裝 SMDDP 程式庫的舊版 TensorFlow DLCs，請參閱TensorFlow (已棄用)。


from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker AI data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用 SMDDP 啟動分散式訓練任務

使用 SageMaker AI 一般估算器來擴充預先建置的 DLC 容器