Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Use EKS-optimized accelerated AMIs for GPU instances
Amazon EKS supports EKS-optimized Amazon Linux and Bottlerocket AMIs for GPU instances. The EKS-optimized accelerated AMIs simplify running AI and ML workloads in EKS clusters by providing pre-built, validated operating system images for the accelerated Kubernetes stack. In addition to the core Kubernetes components that are included in the standard EKS-optimized AMIs, the EKS-optimized accelerated AMIs include the kernel modules and drivers required to run the NVIDIA GPU G and P EC2 instances, and the AWS GPU Inferentia
The table below shows the supported GPU instance types for each EKS-optimized accelerated AMI variant. See the EKS-optimized AL2023 releases
| EKS AMI variant | EC2 instance types |
|---|---|
|
AL2023 x86_64 NVIDIA |
p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn |
|
AL2023 ARM NVIDIA |
p6e-gb200, g5g |
|
AL2023 x86_64 Neuron |
inf1, inf2, trn1, trn2 |
|
Bottlerocket x86_64 aws-k8s-nvidia |
p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn |
|
Bottlerocket aarch64/arm64 aws-k8s-nvidia |
g5g |
|
Bottlerocket x86_64 aws-k8s |
inf1, inf2, trn1, trn2 |
EKS-optimized NVIDIA AMIs
By using the EKS-optimized NVIDIA AMIs, you agree to NVIDIA’s Cloud End User License Agreement (EULA)
To find the latest EKS-optimized NVIDIA AMIs, see Retrieve recommended Amazon Linux AMI IDs and Retrieve recommended Bottlerocket AMI IDs.
When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket NVIDIA AMIs, you must install the EFA device plugin separately. For more information, see Run machine learning training on Amazon EKS with Elastic Fabric Adapter.
EKS AL2023 NVIDIA AMIs
When using the NVIDIA GPU operator
In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA AMIs include the following components.
-
NVIDIA driver
-
NVIDIA CUDA user mode driver
-
NVIDIA container toolkit
-
NVIDIA fabric manager
-
NVIDIA persistenced
-
NVIDIA IMEX driver
-
NVIDIA NVLink Subnet Manager
-
EFA minimal (kernel module and rdma-core)
For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the NVIDIA documentationnvidia-smi is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.
To track the status of the EKS-optimized NVIDIA AMIs upgrade to NVIDIA driver 580 version, see GitHub issue #2470
See the EKS AL2023 NVIDIA AMI installation scriptdnf list installed command.
When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. dnf upgrade) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production.
When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for installation
EKS Bottlerocket NVIDIA AMIs
When using the NVIDIA GPU operator
In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. The minimal dependencies for EFA (kernel module and rdma-core) are installed in all Bottlerocket variants.
-
NVIDIA Kubernetes device plugin
-
NVIDIA driver
-
NVIDIA CUDA user mode driver
-
NVIDIA container toolkit
-
NVIDIA fabric manager
-
NVIDIA persistenced
-
NVIDIA IMEX driver
-
NVIDIA NVLink Subnet Manager
-
NVIDIA MIG manager
For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the NVIDIA documentationnvidia-smi is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.
See the Bottlerocket Version Information in the Bottlerocket documentation
EKS-optimized Neuron AMIs
For details on how to run training and inference workloads using Neuron with Amazon EKS, see the following references:
-
Containers - Kubernetes - Getting Started
in the AWS Neuron Documentation -
Training example
in AWS Neuron EKS Samples on GitHub
To find the latest EKS-optimized Neuron AMIs, see Retrieve recommended Amazon Linux AMI IDs and Retrieve recommended Bottlerocket AMI IDs.
When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket Neuron AMIs, you must install the EFA device plugin separately. For more information, see Run machine learning training on Amazon EKS with Elastic Fabric Adapter.
EKS AL2023 Neuron AMIs
The EKS-optimized AL2023 Neuron AMIs do not include the Neuron Kubernetes device plugin or the Neuron Kubernetes scheduler extension
In addition to the standard EKS AMI components, the EKS-optimized AL2023 Neuron AMIs include the following components.
-
Neuron driver (aws-neuronx-dkms)
-
Neuron tools (aws-neuronx-tools)
-
EFA minimal (kernel module and rdma-core)
See the EKS AL2023 Neuron AMI installation scriptdnf list installed command.
EKS Bottlerocket Neuron AMIs
The standard Bottlerocket variants (aws-k8s) include the Neuron dependencies that are automatically detected and loaded when running on AWS Inferentia or Trainium EC2 instances.
The EKS-optimized Bottlerocket AMIs do not include the Neuron Kubernetes device plugin or the Neuron Kubernetes scheduler extension
In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket Neuron AMIs include the following components.
-
Neuron driver (aws-neuronx-dkms)
-
EFA minimal (kernel module and rdma-core)
When using the EKS-optimized Bottlerocket AMIs with Neuron instances, the following must be configured in the Bottlerocket user-data. This setting allows the container to take ownership of the mounted Neuron device based on the runAsUser and runAsGroup values provided in the workload specification. For more information on Neuron support in Bottlerocket, see the Quickstart on EKS readme
[settings] [settings.kubernetes] device-ownership-from-security-context = true
See the Bottlerocket kernel kit changelog