

# AWS Deep Learning AMI GPU PyTorch 2.4 (Ubuntu 22.04)
<a name="aws-deep-learning-ami-gpu-pytorch-2.4-ubuntu-22-04"></a>

For help getting started, see [Getting started with DLAMI](getting-started.md).

#### AMI Name Format
<a name="name-gpu-pytorch-2.4-ubuntu-22-04"></a>
+ Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.\$1\$1PATCH\$1VERSION\$1 (Ubuntu 22.04) \$1\$1YYYY-MM-DD\$1

#### Supported EC2 instances
<a name="instances-gpu-pytorch-2.4-ubuntu-22-04"></a>
+ Please refer to [Important changes to DLAMI](important-changes.md) .
+ Deep Learning with OSS Nvidia Driver supports G4dn, G5, G6, Gr6, P4, P4de, P5, P5e, P5en.

#### The AMI includes the following:
<a name="contents-gpu-pytorch-2.4-ubuntu-22-04"></a>
+ **Supported AWS Service**: EC2
+ **Operating System**: Ubuntu 22.04
+ **Compute Architecture**: x86
+ **Python**: /opt/conda/envs/pytorch/bin/python
+ **NVIDIA Driver**:
  + OSS Nvidia driver: 550.144.03
+ **NVIDIA CUDA12.1 stack**:
  + CUDA, NCCL and cuDDN installation path: /usr/local/cuda-12.4/
  + **Default CUDA:** 12.4
    + PATH /usr/local/cuda points to /usr/local/cuda-12.4/
    + Updated below env vars:
      +  LD\$1LIBRARY\$1PATH to have /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86\$164-linux/lib
      + PATH to have /usr/local/cuda/bin/:/usr/local/cuda/include/
  + Compiled system NCCL Version present at /usr/local/cuda/: 2.21.5
  + PyTorch Compiled NCCL Version from PyTorch conda environment: 2.20.5
+  **NCCL Tests Location: ** 
  + all\$1reduce, all\$1gather and reduce\$1scatter: /usr/local/cuda-xx.x/efa/test-cuda-xx.x/
  + To run NCCL tests, LD\$1LIBRARY\$1PATH is already with updated with needed paths.
    + Common PATHs are already added to LD\$1LIBRARY\$1PATH:
      +  `/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib`
  + LD\$1LIBRARY\$1PATH is updated with CUDA version paths
    +  /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cud/targets/x86\$164-linux/lib
+ **EFA Installer**: 1.34.0
+ **Nvidia GDRCopy:** 2.4.1
+ **Nvidia Transformer Engine:** v1.11.0
+ **AWS OFI NCCL plugin**: is installed as part of the `EFA Installer-aws`
  + **Installation path:** `/opt/aws-ofi-nccl/` . Path `/opt/aws-ofi-nccl/lib` is added to LD\$1LIBRARY\$1PATH.
  + **Tests path** for ring, message\$1transfer: `/opt/aws-ofi-nccl/tests`
  + Note: PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package `aws-ofi-nccl-dlc` package as well and PyTorch will use that package instead of system AWS OFI NCCL.
+ **AWS CLI v2** as `aws2` and **AWS CLI v1** as `aws`
+ **EBS volume type**: gp3
+ **Python version:** 3.11
+  **Query AMI-ID with SSM Parameter (example Region is us-east-1):** 
  +  **OSS Nvidia Driver:** 

    ```
    aws ssm get-parameter --region us-east-1 \
            --name /aws/service/deeplearning/ami/x86_64/oss-nvidia-driver-gpu-pytorch-2.4-ubuntu-22.04/latest/ami-id \
            --query "Parameter.Value" \
            --output text
    ```
+  **Query AMI-ID with AWSCLI (example Region is us-east-1):** 
  +  **OSS Nvidia Driver:** 

    ```
    aws ec2 describe-images --region us-east-1 \
        --owners amazon \
        --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.? (Ubuntu 22.04) ????????' 'Name=state,Values=available' \
        --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' \
        --output text
    ```

#### Notices
<a name="notices-gpu-pytorch-2.4-ubuntu-22-04"></a>

**P5/P5e instances**
+ DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interrfaces.

```
aws ec2 run-instances --region $REGION \
    --instance-type $INSTANCETYPE \
    --image-id $AMI --key-name $KEYNAME \
    --iam-instance-profile "Name=dlami-builder" \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$TAG}]" \
    --network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
      "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
      "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
      "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
      "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \
      ...
      "NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"
```

#### Release Date: 2025-02-17
<a name="2025-02-17-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name:** Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20250216

##### Updated
<a name="w2aac25c13b7c11c13b5"></a>
+ Updated NVIDIA Container Toolkit from version 1.17.3 to version 1.17.4
  + Please see the release notes page here for more information: [ https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.4](https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.4)
  + In Container Toolkit version 1.17.4, the mounting of CUDA compat libraries is now disabled. In order to ensure compatibility with multiple CUDA versions on container workflows, please ensure you update your LD\$1LIBRARY\$1PATH to include your CUDA compatibility libraries as shown in the [If you use a CUDA compatibility layer](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-gpu-drivers.html#collapsible-cuda-compat) tutorial.

#### Release Date: 2025-01-21
<a name="2025-01-21-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name:** Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20250119

##### Updated
<a name="w2aac25c13b7c11c15b5"></a>
+ Upgraded Nvidia driver from version 550.127.05 to 550.144.03 to address CVE's present in the [NVIDIA GPU Display Driver Security Bulletin for January 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5614).

#### Release Date: 2024-11-18
<a name="2024-11-18-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name:** Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241116

##### Fixed
<a name="w2aac25c13b7c11c17b5"></a>
+ Due to a change in the Ubuntu kernel to address a defects in the Kernel Address Space Layout Randomization (KASLR) functionality, G4Dn/G5 instances are unable to properly initialize CUDA on the OSS Nvidia driver. In order to mitigate this issue, this DLAMI includes functionality that dynamically loads the proprietary driver for G4Dn and G5 instances. Please allow a brief initialization period for this loading in order to ensure that your instances are able to work properly.
  + To check the status and health of this service, you can use the following commands:

```
sudo systemctl is-active dynamic_driver_load.service active
```

#### Release Date: 2024-10-16
<a name="2024-10-16-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name**: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241016

##### Added
<a name="w2aac25c13b7c11c19b5"></a>
+ Added Nvidia TransformerEngine v1.11.0 for accelerating Transformer models (For more details, please refer to [ https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html) )

#### Release Date: 2024-09-30
<a name="2024-09-30-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name**: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20240929

##### Updated
<a name="w2aac25c13b7c11c21b5"></a>
+ Upgraded Nvidia Container Toolkit from version 1.16.1 to 1.16.2, addressing the security vulnerability [CVE-2024-0133](https://nvd.nist.gov/vuln/detail/CVE-2024-0133).

#### Release Date: 2024-09-26
<a name="2024-09-26-gpu-pytorch-2.4-ubuntu-22-04"></a>

**AMI name**: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20240925

##### Added
<a name="w2aac25c13b7c11c23b5"></a>
+ Initial release of Deep Learning AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) series. Including a conda environment pytorch complimented with NVIDIA Driver R550, CUDA=12.4.1, cuDNN=8.9.7, PyTorch NCCL=2.20.5, and EFA=1.34.0.