Run distributed training on a heterogeneous cluster in Amazon SageMaker AI

Through the distribution argument of the SageMaker AI ModelTrainer class, you can assign a specific instance group to run distributed training. For example, assume that you have the following two instance groups and want to run multi-GPU training on one of them.


from sagemaker.instance_group import InstanceGroup

instance_group_1 = InstanceGroup("instance_group_1", "ml.c5.18xlarge", 1)
instance_group_2 = InstanceGroup("instance_group_2", "ml.p3dn.24xlarge", 2)

You can set the distributed training configuration for one of the instance groups. For example, the following code examples show how to assign training_group_2 with two ml.p3dn.24xlarge instances to the distributed training configuration.

Note

Currently, only one instance group of a heterogeneous cluster can be specified to the distribution configuration. In the SageMaker AI Python SDK v3, the Torchrun distributed configuration does not accept an instance group parameter and applies to all instances in the training job.

With MPI

With the SageMaker AI data parallel library

Note

When using the SageMaker AI data parallel library, make sure the instance group consists of the supported instance types by the library.

For more information about the SageMaker AI data parallel library, see SageMaker AI Data Parallel Training.

With the SageMaker AI model parallel library

For more information about the SageMaker AI model parallel library, see SageMaker AI Model Parallel Training.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Configure a training job with a heterogeneous cluster in Amazon SageMaker AI

Modify your training script to assign instance groups