Use the SMDDP library in your PyTorch Lightning training script
If you want to bring your PyTorch
          Lightningsmdistributed.dataparallel
        library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept
        the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate
        the SMDDP library by setting the process group backend to "smddp". To learn
        more, walk through the following instructions that break down the steps with code
        examples.
Note
The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.
- 
                Import the pytorch_lightninglibrary and thesmdistributed.dataparallel.torchmodules.import lightning as pl import smdistributed.dataparallel.torch.torch_smddp
- 
                Instantiate the LightningEnvironment . from lightning.fabric.plugins.environments.lightning import LightningEnvironment env = LightningEnvironment() env.world_size = lambda: int(os.environ["WORLD_SIZE"]) env.global_rank = lambda: int(os.environ["RANK"])
- 
                For PyTorch DDP – Create an object of the DDPStrategy class with "smddp"forprocess_group_backendand"gpu"foraccelerator, and pass that to the Trainerclass. import lightning as pl from lightning.pytorch.strategies import DDPStrategy ddp = DDPStrategy( cluster_environment=env, process_group_backend="smddp", accelerator="gpu" ) trainer = pl.Trainer( max_epochs=200, strategy=ddp, devices=num_gpus, num_nodes=num_nodes )For PyTorch FSDP – Create an object of the FSDPStrategy class (with wrapping policy of choice) with "smddp"forprocess_group_backendand"gpu"foraccelerator, and pass that to the Trainerclass. import lightning as pl from lightning.pytorch.strategies import FSDPStrategy from functools import partial from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy policy = partial( size_based_auto_wrap_policy, min_num_params=10000 ) fsdp = FSDPStrategy( auto_wrap_policy=policy, process_group_backend="smddp", cluster_environment=env ) trainer = pl.Trainer( max_epochs=200, strategy=fsdp, devices=num_gpus, num_nodes=num_nodes )
After you have completed adapting your training script, proceed to Launching distributed training jobs with SMDDP using the SageMaker Python SDK.
Note
When you construct a SageMaker AI PyTorch estimator and submit a training job request in Launching distributed training jobs with SMDDP using the
      SageMaker Python SDK, you need to provide requirements.txt to
          install pytorch-lightning and lightning-bolts in the SageMaker AI
          PyTorch training container.
# requirements.txt pytorch-lightning lightning-bolts
For more information about specifying the source directory to place the
            requirements.txt file along with your training script and a job submission,
          see Using third-party libraries