

 **Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Use P6e-GB200 UltraServers with Amazon EKS
<a name="ml-eks-nvidia-ultraserver"></a>

This topic describes how to configure and use Amazon EKS with P6e-GB200 UltraServers. The `p6e-gb200.36xlarge` instance type with 4 NVIDIA Blackwell GPUs is only available as P6e-GB200 UltraServers. There are two types of P6e-GB200 UltraServers. The `u-p6e-gb200x36` UltraServer has 9 `p6e-gb200.36xlarge` instances and the `u-p6e-gb200x72` UltraServer has 18 `p6e-gb200.36xlarge` instances.

To learn more, see the [Amazon EC2 P6e-GB200 UltraServers webpage](https://aws.amazon.com/ec2/instance-types/p6/).

## Considerations
<a name="nvidia-ultraserver-considerations"></a>
+ Amazon EKS supports P6e-GB200 UltraServers for Kubernetes versions 1.33 and above. This Kubernetes version release provides support for [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (DRA), enabled by default in EKS and in the [AL2023 EKS-optimized accelerated AMIs](https://docs.aws.amazon.com/eks/latest/userguide/ml-eks-optimized-ami.html). DRA is a requirement to use the P6e-GB200 UltraServers with EKS. DRA is not supported in Karpenter or EKS Auto Mode, and it is recommended to use EKS self-managed node groups or EKS managed node groups when using the P6e-GB200 UltraServers with EKS.
+ P6e-GB200 UltraServers are made available through [EC2 Capacity Blocks for ML](https://aws.amazon.com/ec2/capacityblocks/). See [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for information on how to launch EKS nodes with Capacity Blocks.
+ When using EKS managed node groups with Capacity Blocks, you must use custom launch templates. When upgrading EKS managed node groups with P6e-GB200 UltraServers, you must set the desired size of the node group to `0` before upgrading.
+ It is recommended to use the AL2023 ARM NVIDIA variant of the EKS-optimized accelerated AMIs. This AMI includes the required node components and configuration to work with P6e-GB200 UltraServers. If you decide to build your own AMI, you are responsible for installing and validating the compatibility of the node and system software, including drivers. For more information, see [Use EKS-optimized accelerated AMIs for GPU instances](ml-eks-optimized-ami.md).
+ It is recommended to use EKS-optimized AMI release `v20251103` or later, which includes NVIDIA driver version 580. This NVIDIA driver version enables Coherent Driver-Based Memory (CDMM) to address potential memory over-reporting. When CDMM is enabled, the following capabilities are not supported: NVIDIA Multi-Instance GPU (MIG) and vGPU. For more information on CDMM, see [NVIDIA Coherent Driver-based Memory Management (CDMM)](https://nvdam.widen.net/s/gpqp6wmz7s/cuda-whitepaper—​cdmm-pdf).
+ When using the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI. The EKS-optimized AL2023 NVIDIA AMIs do not include the NVIDIA Kubernetes device plugin or the NVIDIA DRA driver, and these must be installed separately.
+ Each `p6e-gb200.36xlarge` instance can be configured with up to 17 network cards and can leverage EFA for communication between UltraServers. Workload network traffic can cross UltraServers, but for highest performance it is recommended to schedule workloads in the same UltraServer leveraging IMEX for intra-UltraServer GPU communication. For more information, see [EFA configuration for P6e-GB200 instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e).
+ Each `p6e-gb200.36xlarge` instance has 3x 7.5TB [instance store storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html). By default, the EKS-optimized AMI does not format and mount the instance stores. The node’s ephemeral storage can be shared among pods that request ephemeral storage and container images that are downloaded to the node. If using the AL2023 EKS-optimized AMI, this can be configured as part of the nodes bootstrap in the user data by setting the instance local storage policy in [NodeConfig](https://docs.aws.amazon.com/eks/latest/eksctl/node-bootstrapping.html#configuring-the-bootstrapping-process) to RAID0. Setting to RAID0 stripes the instance stores and configures the container runtime and kubelet to make use of this ephemeral storage.

## Components
<a name="nvidia-ultraserver-components"></a>

The following components are recommended for running workloads on EKS with the P6e-GB200 UltraServers. You can optionally use the [NVIDIA GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) to install the NVIDIA node components. When using the NVIDIA GPU operator with the EKS-optimized AL2023 NVIDIA AMI, you must disable the operator installation of the driver and toolkit, as these are already included in the AMI.



- ** EKS-optimized accelerated AMI **
  - Kernel 6.12
  - NVIDIA GPU driver
  - NVIDIA CUDA user mode driver
  - NVIDIA container toolkit
  - NVIDIA fabric manager
  - NVIDIA IMEX driver
  - NVIDIA NVLink Subnet Manager
  - EFA driver

- ** Components running on node **
  - VPC CNI
  - EFA DRA driver or EFA device plugin
  - NVIDIA K8s device plugin
  - NVIDIA DRA driver
  - NVIDIA Node Feature Discovery (NFD)
  - NVIDIA GPU Feature Discovery (GFD)



The node components in the table above perform the following functions:
+  **VPC CNI**: Allocates VPC IPs as the primary network interface for pods running on EKS
+  **EFA DRA driver or EFA device plugin**: Allocates EFA devices as secondary networks for pods running on EKS. Responsible for network traffic across P6e-GB200 UltraServers. For multi-node workloads, GPU-to-GPU traffic within an UltraServer can flow over multi-node NVLink. The EFA DRA driver is recommended for Kubernetes 1.34 and later and provides topology-aware allocation and device sharing. The EFA device plugin is supported for all Kubernetes versions. For more information, see [Manage EFA devices on Amazon EKS](device-management-efa.md).
+  **NVIDIA Kubernetes device plugin**: Allocates GPUs as devices for pods running on EKS. It is recommended to use the NVIDIA Kubernetes device plugin until the NVIDIA DRA driver GPU allocation functionality graduates from experimental. See the [NVIDIA DRA driver releases](https://github.com/kubernetes-sigs/nvidia-dra-driver-gpu/releases) for updated information.
+  **NVIDIA DRA driver**: Enables ComputeDomain custom resources that facilitate creation of IMEX domains that follow workloads running on P6e-GB200 UltraServers.
  + The ComputeDomain resource describes an Internode Memory Exchange (IMEX) domain. When workloads with a ResourceClaim for a ComputeDomain are deployed to the cluster, the NVIDIA DRA driver automatically creates an IMEX DaemonSet that runs on matching nodes and establishes the IMEX channel(s) between the nodes before the workload is started. To learn more about IMEX, see [overview of NVIDIA IMEX for multi-node NVLink systems](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html).
  + The NVIDIA DRA driver uses a clique ID label (`nvidia.com/gpu.clique`) applied by NVIDIA GFD that relays the knowledge of the network topology and NVLink domain.
  + It is a best practice to create a ComputeDomain per workload job.
+  **NVIDIA Node Feature Discovery (NFD)**: Required dependency for GFD to apply node labels based on discovered node-level attributes.
+  **NVIDIA GPU Feature Discovery (GFD)**: Applies an NVIDIA standard topology label called `nvidia.com/gpu.clique` to the nodes. Nodes within the same `nvidia.com/gpu.clique` have multi-node NVLink-reachability, and you can use pod affinities in your application to schedule pods to the same NVlink domain.

## Procedure
<a name="nvidia-ultraserver-procedure"></a>

The following section assumes you have an EKS cluster running Kubernetes version 1.33 or above with one or more node groups with P6e-GB200 UltraServers running the AL2023 ARM NVIDIA EKS-optimized accelerated AMI. See the links in [Manage compute resources for AI/ML workloads on Amazon EKS](ml-compute-management.md) for the prerequisite steps for EKS self-managed nodes and managed node groups.

The following procedure uses the components below.


| Name | Version | Description | 
| --- | --- | --- | 
| NVIDIA GPU Operator | 25.3.4\+ | For lifecycle management of required plugins such as NVIDIA Kubernetes device plugin and NFD/GFD. | 
| NVIDIA DRA Drivers | 25.8.0\+ | For ComputeDomain CRDs and IMEX domain management. | 
| EFA DRA driver (DRANET) | Latest | For cross-UltraServer communication with topology-aware allocation. Recommended for Kubernetes 1.34\+. | 
| EFA Device Plugin | 0.5.14\+ | For cross-UltraServer communication. Supported for all Kubernetes versions. | 

## Install NVIDIA GPU operator
<a name="nvidia-ultraserver-gpu-operator"></a>

The NVIDIA GPU operator simplifies the management of components required to use GPUs in Kubernetes clusters. As the NVIDIA GPU driver and container toolkit are installed as part of the EKS-optimized accelerated AMI, these must be set to `false` in the Helm values configuration.

1. Create a Helm values file named `gpu-operator-values.yaml` with the following configuration.

   ```
   devicePlugin:
     enabled: true
   nfd:
     enabled: true
   gfd:
     enabled: true
   driver:
     enabled: false
   toolkit:
     enabled: false
   migManager:
     enabled: false
   ```

1. Install the NVIDIA GPU operator for your cluster using the `gpu-operator-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator \
    --create-namespace \
    --version v25.3.4 \
    --values gpu-operator-values.yaml
   ```

## Install NVIDIA DRA driver
<a name="nvidia-ultraserver-dra-driver"></a>

As of NVIDIA GPU operator version `v25.3.4`, the NVIDIA DRA driver must be installed separately. It is recommended to track the NVIDIA GPU operator [release notes](https://github.com/NVIDIA/gpu-operator/releases) as this may change in a future release.

1. Create a Helm values file named `dra-values.yaml` with the following configuration. Note the `nodeAffinity` and `tolerations` that configures the DRA driver to deploy only on nodes with an NVIDIA GPU.

   ```
   resources:
     gpus:
       enabled: false # set to false to disable experimental gpu support
     computeDomains:
       enabled: true
   
   controller:
     nodeSelector: null
     affinity: null
     tolerations: []
   
   kubeletPlugin:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: "nvidia.com/gpu.present"
               operator: In
               values:
               - "true"
     tolerations:
       - key: "nvidia.com/gpu"
         operator: Exists
         effect: NoSchedule
   ```

1. Install the NVIDIA DRA driver for your cluster using the `dra-values.yaml` file you created in the previous step.

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

   ```
   helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
     --version="25.8.0" \
     --namespace nvidia-dra-driver-gpu \
     --create-namespace \
     -f dra-values.yaml
   ```

1. After installation, the DRA driver creates `DeviceClass` resources that enable Kubernetes to understand and allocate `ComputeDomain` resources, making the IMEX management possible for distributed GPU workloads on P6e-GB200 UltraServers.

   Confirm the DRA resources are available with the following commands.

   ```
   kubectl api-resources | grep resource.k8s.io
   ```

   ```
   deviceclasses           resource.k8s.io/v1  false        DeviceClass
   resourceclaims          resource.k8s.io/v1  true         ResourceClaim
   resourceclaimtemplates  resource.k8s.io/v1  true         ResourceClaimTemplate
   resourceslices          resource.k8s.io/v1  false        ResourceSlice
   ```

   ```
   kubectl get deviceclasses
   ```

   ```
   NAME
   compute-domain-daemon.nvidia.com
   compute-domain-default-channel.nvidia.com
   ```

## Install EFA for cross-UltraServer communication
<a name="nvidia-ultraserver-efa"></a>

To use EFA communication between UltraServers, install the EFA DRA driver (DRANET) or the EFA device plugin. P6e-GB200 instances can be configured with up to [17 network cards](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) and the primary NCI (index 0) must be of type `interface` and supports up to 100 Gbps of ENA bandwidth. Configure your EFA and ENA interfaces as per your requirements during node provisioning. Review the [EFA configuration for P6e-GB200 instances AWS documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e) for more details on EFA configuration.

**Important**  
Do not install the EFA DRA driver and the EFA device plugin on the same node. The two mechanisms cannot coexist on the same node.

### Option 1: Install the EFA DRA driver (DRANET)
<a name="nvidia-ultraserver-efa-dra"></a>

1. Create a Helm values file named `efa-values.yaml` with the following configuration.

   ```
   tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Add the EKS Helm chart repository and install the EFA DRA driver.

   ```
   helm repo add eks https://aws.github.io/eks-charts
   helm repo update
   ```

   ```
   helm install aws-dranet eks/aws-dranet --namespace kube-system -f efa-values.yaml
   ```

1. Verify that the DRANET DaemonSet is running.

   ```
   kubectl get daemonset -n kube-system aws-dranet
   ```

   ```
   NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   aws-dranet    2         2         2       2            2           <none>          60s
   ```

1. Verify that the `DeviceClass` and `ResourceSlice` objects are available.

   ```
   kubectl get deviceclass efa.networking.k8s.aws
   ```

   ```
   NAME                    AGE
   efa.networking.k8s.aws  60s
   ```

   ```
   kubectl get resourceslices -l resource.k8s.io/driver=dra.net
   ```

For more information on using the EFA DRA driver, including topology-aware allocation and device sharing, see [Manage EFA devices on Amazon EKS](device-management-efa.md).

### Option 2: Install the EFA device plugin
<a name="nvidia-ultraserver-efa-plugin"></a>

1. Create a Helm values file named `efa-values.yaml` with the following configuration.

   ```
   tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Add the EKS Helm chart repository and install the EFA device plugin.

   ```
   helm repo add eks https://aws.github.io/eks-charts
   helm repo update
   ```

   ```
   helm install efa eks/aws-efa-k8s-device-plugin -n kube-system -f efa-values.yaml
   ```

1. Verify the EFA device plugin DaemonSet is running.

   ```
   kubectl get daemonset -n kube-system aws-efa-k8s-device-plugin-daemonset
   ```

   ```
   NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
   aws-efa-k8s-device-plugin-daemonset   2         2         2       2            2           <none>          60s
   ```

1. Verify that your nodes have allocatable EFA devices. As an example, if you configured your instances with 1 efa-only interface in each [NCI group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p6e), it is expected to see 4 allocatable EFA devices per node.

   ```
   kubectl get nodes "-o=custom-columns=NAME:.metadata.name,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"
   ```

   ```
   NAME                                           EFA
   ip-192-168-11-225.us-west-2.compute.internal   4
   ip-192-168-24-96.us-west-2.compute.internal    4
   ```

## Validate IMEX over Multi-Node NVLink
<a name="nvidia-ultraserver-imex-nvlink"></a>

For a multi-node NVLINK NCCL test and other micro-benchmarks review the [awesome-distributed-training](https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests) GitHub repository. The following steps show how to run a multi-node NVLink test with nvbandwidth.

1. To run a multi-node bandwidth test across two nodes in the NVL72 domain, first install the MPI operator:

   ```
   kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.7.0/mpi-operator.yaml
   ```

1. Create a file named `nvbandwidth-test-job.yaml` that defines the test manifest. Note the `nvidia.com/gpu.clique` pod affinity to schedule the workers in the same NVLink domain which has Multi-Node NVLink reachability. The sample below runs a multi-node device-to-device CE Read memcpy test using cuMemcpyAsync and prints the results in the logs.

   As of NVIDIA DRA Driver version `v25.8.0` ComputeDomains are elastic and `.spec.numNodes` can be set to `0` in the ComputeDomain definition. Review the latest [NVIDIA DRA Driver release notes](https://github.com/kubernetes-sigs/nvidia-dra-driver-gpu/releases) for updates.

   There can be only one ComputeDomain (IMEX channel) per node. Do not change the `allocationMode` to `All` for the ComputeDomain resource, as it can prevent the ComputeDomain and Pods accessing that ComputeDomain from being allocated and scheduled correctly. For more information, see [NVIDIA DRA driver issue \#353](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/issues/353).

```
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 0 # This can be set to 0 from NVIDIA DRA Driver version v25.8.0+
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4 # 4 GPUs per worker node
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-launcher
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  # Only schedule on NVIDIA GB200/GB300 nodes
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values:
                    - p6e-gb200.36xlarge
                    - p6e-gb300.36xlarge
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
    Worker:
      replicas: 2 # 2 worker nodes
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-worker
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  # Only schedule on NVIDIA GB200/GB300 nodes
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values:
                    - p6e-gb200.36xlarge
                    - p6e-gb300.36xlarge
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: nvbandwidth-test-replica
                    operator: In
                    values:
                    - mpi-worker
                topologyKey: nvidia.com/gpu.clique
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4  # Request 4 GPUs per worker
              claims:
              - name: compute-domain-channel # Link to IMEX channel
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
```

\+ . Create the ComputeDomain and start the job with the following command.

\+

```
kubectl apply -f nvbandwidth-test-job.yaml
```

\+ . ComputeDomain creation, you can see the workload’s ComputeDomain has two nodes:

\+

```
kubectl get computedomains.resource.nvidia.com -o yaml
```

\+

```
status:
  nodes:
  - cliqueID: <ClusterUUID>.<Clique ID>
    ipAddress: <node-ip>
    name: <node-hostname>
  - cliqueID: <ClusterUUID>.<Clique ID>
    ipAddress: <node-ip>
    name: <node-hostname>
  status: Ready
```

\+ . Review the results of the job with the following command.

\+

```
kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
```

\+ A successful test shows bandwidth statistics in GB/s for the multi-node memcpy test. An example of a successful test output is shown below.

\+

```
...
nvbandwidth Version: ...
Built from Git version: ...

MPI version: ...
CUDA Runtime Version: ...
CUDA Driver Version: ...
Driver Version: ...

Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (...)
Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (...)
Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (...)
Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (...)
Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (...)
Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (...)
Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (...)
Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (...)

Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    821.45    822.18    821.73    822.05    821.38    822.61    821.89
 1    822.34       N/A    821.67    822.12    821.94    820.87    821.53    822.08
 2    821.76    822.29       N/A    821.58    822.43    821.15    821.82    822.31
 3    822.19    821.84    822.05       N/A    821.67    821.23    820.95    822.47
 4    821.63    822.38    821.49    822.17       N/A    821.06    821.78    822.22
 5    822.08    821.52    821.89    822.35    821.27       N/A    821.64    822.13
 6    821.94    822.15    821.68    822.04    821.39    820.92       N/A    822.56
 7    822.27    821.73    822.11    821.86    822.38    821.04    821.49       N/A

SUM multinode_device_to_device_memcpy_read_ce ...

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
```

\+ . When the test is complete, delete it with the following command.

\+

```
kubectl delete -f nvbandwidth-test-job.yaml
```