

# Best Practices for Running AI/ML Workloads
<a name="aiml"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

Implementing best practices when running AI/ML workloads on EKS can ensure that those workloads are performant, cost-effective, resilient, and properly resourced. Best practices are divided into the following general sections: Compute, Networking, Storage, Observability, and Performance.

## Feedback
<a name="_feedback"></a>

This guide is being released on GitHub so as to collect direct feedback and suggestions from the broader EKS/Kubernetes community. If you have a best practice that you feel we ought to include in the guide, please file an issue or submit a PR in the GitHub repository. Our intention is to update the guide periodically as new features are added to the service or when a new best practice evolves.

# Compute and Autoscaling
<a name="aiml-compute"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## GPU Resource Optimization and Cost Management
<a name="_gpu_resource_optimization_and_cost_management"></a>

### Schedule workloads with GPU requirements using Well-Known labels
<a name="_schedule_workloads_with_gpu_requirements_using_well_known_labels"></a>

For AI/ML workloads sensitive to different GPU characteristics (e.g. GPU, GPU memory) we recommend specifying GPU requirements using [known scheduling labels](https://kubernetes.io/docs/reference/labels-annotations-taints/) supported by node types used with [Karpenter](https://karpenter.sh/v1.0/concepts/scheduling/#labels) and [managed node groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). Failing to define these can result in pods being scheduled on instances with inadequate GPU resources, causing failures or degraded performance. We recommend using [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) or [Node affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity) to specify which node a pod should run on and setting compute [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) (CPU, memory, GPUs etc) in the pod’s resources section.

 **Example** 

For example, using GPU name node selector when using Karpenter:

```
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: ml-workload
    image: <image>
    resources:
      limits:
        nvidia.com/gpu: 1  # Request one NVIDIA GPU
  nodeSelector:
    karpenter.k8s.aws/instance-gpu-name: "l40s"  # Run on nodes with NVIDIA L40S GPUs
```

### Use Kubernetes Device Plugin for exposing GPUs
<a name="_use_kubernetes_device_plugin_for_exposing_gpus"></a>

To expose GPUs on nodes, the NVIDIA GPU driver must be installed on the node’s operating system and container runtime configured to allow the Kubernetes scheduler to assign pods to nodes with available GPUs. The setup process for the NVIDIA Kubernetes Device Plugin depends on the EKS Accelerated AMI you are using:
+  ** [Bottlerocket Accelerated AMI](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami-bottlerocket.html) **: This AMI includes the NVIDIA GPU driver **and** the [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) is pre-installed and ready to use, enabling GPU support out of the box. No additional configuration is required to expose GPUs to the Kubernetes scheduler.
+  ** [AL2023 Accelerated AMI](https://aws.amazon.com/blogs/containers/amazon-eks-optimized-amazon-linux-2023-accelerated-amis-now-available/) **: This AMI includes NVIDIA GPU driver but the [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) is **not** pre-installed. You must install and configure the device plugin separately, typically via a DaemonSet. Note that if you use eksctl to create your cluster and specify a GPU instance type (e.g., `g5.xlarge`) in your ClusterConfig, `eksctl` will automatically select the appropriate AMI and install the NVIDIA Kubernetes Device Plugin. To learn more, see [GPU support](https://eksctl.io/usage/gpu-support/) in eksctl documentation.

If you decide to use the EKS Accelerated AMIs and [NVIDIA GPU operator](https://github.com/NVIDIA/gpu-operator) to manage components such as the NVIDIA Kubernetes device plugin instead, take note to disable management of the NVIDIA GPU driver and NVIDIA Container toolkit as per the [Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers-and-nvidia-container-toolkit) NVIDIA documentation.

To verify that the NVIDIA Device Plugin is active and GPUs are correctly exposed, run:

```
kubectl describe node | grep nvidia.com/gpu
```

This command checks if the `nvidia.com/gpu` resource is in the node’s capacity and allocatable resources. For example, a node with one GPU should show `nvidia.com/gpu: 1`. See the [Kubernetes GPU Scheduling Guide](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) for more information.

### Use many different EC2 instance types
<a name="_use_many_different_ec2_instance_types"></a>

Using as many different EC2 instance types as possible is an important best practice for scalability on Amazon EKS, as outlined in the [Kubernetes Data Plane](scale-data-plane.md) section. This recommendation also applies to instances with accelerated hardware (e.g., GPUs). If you create a cluster that uses only one instance type and try to scale the number of nodes beyond the capacity of the region, you may receive an insufficient capacity error (ICE), indicating that no instances are available. It’s important to understand the unique characteristics of your AI/ML workloads before diversifying arbitrarily. Review the available instance types using the [EC2 Instance Type Explorer](https://aws.amazon.com/ec2/instance-explorer/) tool to generate a list of instance types that match your specific compute requirements, and avoid arbitrarily limiting the type of instances that can be used in your cluster.

Accelerated compute instances are offered in different purchase models to fit short term, medium term and steady state workloads. For short term, flexible and fault tolerant workloads, where you’d like to avoid making a reservation, look into Spot instances. Capacity Blocks, On-Demand instances and Saving Plans allow you to provision accelerated compute instances for medium and long term workload duration. To increase the chances of successfully accessing the required capacity in your preferred purchase option, it’s recommended to use a diverse list of instance types and availability zones. Alternatively, if you encounter ICEs for a specific purchase model, retry using a different model.

 **Example** The following example shows how to enable a Karpenter NodePool to provision G and P instances greater than generations 3 (e.g., p3). To learn more, see the [EKS Scalability best practices](scalability.md) section.

```
- key: karpenter.k8s.aws/instance-category
  operator: In
  values: ["g", "p"] # Diversifies across G-series and P-series
- key: karpenter.k8s.aws/instance-generation
  operator: Gt
  values: ["3"] # Selects instance generations greater than 3
```

For details on using Spot instances for GPUs, see "Consider using Amazon EC2 Spot Instances for GPUs with Karpenter" below.

### Consider using Amazon EC2 Spot Instances for GPUs with Karpenter
<a name="_consider_using_amazon_ec2_spot_instances_for_gpus_with_karpenter"></a>

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Amazon EC2 Spot Instances can be interrupted with a two-minute notice when EC2 needs the capacity back. For more information, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the Amazon EC2 User Guide. Amazon EC2 Spot can be a great choice for fault-tolerant, stateless and flexible (time and instance type) workloads. To learn more about when to use Spot instances, see [EC2 Spot Instances Best Practices](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html). You can also use Spot Instances for AI/ML workloads if they’re Spot-friendly.

 **Use cases** 

Spot-friendly workloads can be big data, containerized workloads, CI/CD, stateless web servers, high performance computing (HPC), and rendering workloads. Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes (e.g., workloads with parallel processes that depend heavily on each other for computation, requiring constant inter-node communication, such as MPI-based high-performance computing applications like computational fluid dynamics or distributed databases with complex interdependencies). Here are the specific use cases we recommend (in no particular order):
+  **Real-time online inference**: Use Spot instances for cost-optimized scaling for your real-time inference workloads, as long as your workloads are spot-friendly. In other words, the inference time is either less than two minutes, the application is fault-tolerant to interruptions, and can run on different instance types. Ensure high availability through instance diversity (e.g., across multiple instance types and Availability Zones) or reservations, while implementing application-level fault tolerance to handle potential Spot interruptions.
+  **Hyper-parameter tuning**: Use Spot instances to run exploratory tuning jobs opportunistically, as interruptions can be tolerated without significant loss, especially for short-duration experiments.
+  **Data augmentation**: Use Spot instances to perform data preprocessing and augmentation tasks that can restart from checkpoints if interrupted, making them ideal for Spot’s variable availability.
+  **Fine-tuning models**: Use Spot instances for fine-tuning with robust checkpointing mechanisms to resume from the last saved state, minimizing the impact of instance interruptions.
+  **Batch inference**: Use Spot instances to process large batches of offline inference requests in a non-real-time manner, where jobs can be paused and resumed, offering the best alignment with Spot’s cost savings and handling potential interruptions through retries or diversification.
+  **Opportunistic training subsets**: Use Spot instances for marginal or experimental training workloads (e.g., smaller models under 10 million parameters), where interruptions are acceptable and efficiency optimizations like diversification across instance types or regions can be applied—though not recommended for production-scale training due to potential disruptions.

 **Considerations** 

To use Spot Instances for accelerated workloads on Amazon EKS, there are a number of key considerations (in no particular order):
+  **Use Karpenter to manage Spot instances with advanced consolidation enabled**. By specifying karpenter.sh/capacity-type as "spot" in your Karpenter NodePool, Karpenter will provision Spot instances by default without any additional configuration. However, to enable advanced Spot-to-Spot consolidation, which replaces underutilized Spot nodes with lower-priced Spot alternatives, you need to enable the SpotToSpotConsolidation [feature gate](https://karpenter.sh/docs/reference/settings/) by setting --feature-gates SpotToSpotConsolidation=true in Karpenter controller arguments or via the FEATURE\$1GATES environment variable. Karpenter uses the [price-capacity-optimized](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet-allocation-strategy.html) allocation strategy to provision EC2 instances. Based on the NodePool requirements and pod constraints, Karpenter bin-packs unschedulable pods and sends a diverse set of instance types to the [Amazon EC2 Fleet API](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet-request-type.html). You can use the [EC2 Instance Type Explorer](https://aws.amazon.com/ec2/instance-explorer/) tool to generate a list of instance types that match your specific compute requirements.
+  **Ensure workloads are stateless, fault-tolerance and flexible**. Workloads must be stateless, fault-tolerant, and flexible in terms of instance/GPU size. This allows seamless resumption after Spot interruptions, and instance flexibility enables you to potentially stay on Spot for longer. Enable [Spot interruption handling](https://karpenter.sh/docs/concepts/disruption/#interruption) in Karpenter by configuring the settings.interruptionQueue Helm value with the name of the AWS SQS queue to catch Spot interruption events. For example, when installing via Helm, use --set "settings.interruptionQueue=\$1\$1CLUSTER\$1NAME\$1". To see an example, see the [Getting Started with Karpenter](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/) guide. When Karpenter notices a Spot interruption event, it automatically cordons, taints, drains, and terminates the node(s) ahead of the interruption event to maximize the termination grace period of the pods. At the same time, Karpenter will immediately start a new node so it can be ready as soon as possible.
+  **Avoid overly constraining instance type selection**. You should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost. For example, avoid limiting to specific types (e.g., g5.xlarge). Consider specifying a diverse set of instance categories and generations using keys like karpenter.k8s.aws/instance-category and karpenter.k8s.aws/instance-generation. Karpenter enables easier diversification of on-demand and Spot instance capacity across multiple instance types and Availability Zones (AZs). Moreover, if your AI/ML workload requires specific or limited number of accelerators but is flexible between regions, you can use Spot Placement Score to dynamically identify the optimal region to deploy your workload before launch.
+  **Broaden NodePool requirements to include a larger number of similar EC2 instance families**. Every Spot Instance pool consists of an unused EC2 instance capacity for a specific instance type in a specific Availability Zone (AZ). When Karpenter tries to provision a new node, it selects an instance type that matches the NodePool’s requirements. If no compatible instance type has Spot capacity in any AZ, then provisioning fails. To avoid this issue, allow broader g-series instances (generation 4 or higher) from NVIDIA across sizes and Availability Zones (AZs), while considering hardware needs like GPU memory or Ray Tracing. As instances can be of different types, you need to make sure that your workload is able to run on each type, and the performance you get meets your needs.
+  **Leverage all availability zones in a region**. Available capacity varies by Availability Zone (AZ), a specific instance type might be unavailable in one AZ but plentiful in another. Each unique combination of an instance type and an Availability Zone constitutes a separate Spot capacity pool. By requesting capacity across all AZs in a region within your Karpenter NodePool requirements, you are effectively searching more pools at once. This maximizes the number of Spot capacity pools and therefore increases the probability of acquiring Spot capacity. To achieve this, in your NodePool configuration, either omit the topology.kubernetes.io/zone key entirely to allow Karpenter to select from all available AZs in the region, or explicitly list AZs using the operator: In and provide the values (e.g., us-west-2a).
+  **Consider using Spot Placement Score (SPS) to get visibility into the likelihood of successfully accessing the required capacity using Spot instances**. [Spot Placement Score (SPS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html) is a tool that provides a score to help you assess how likely a Spot request is to succeed. When you use SPS, you first specify your compute requirements for your Spot Instances, and then Amazon EC2 returns the top 10 Regions or Availability Zones (AZs) where your Spot request is likely to succeed. Regions and Availability Zones are scored on a scale from 1 to 10. A score of 10 indicates that your Spot request is highly likely but not guaranteed to succeed. A score of 1 indicates that your Spot request is not likely to succeed at all. The same score might be returned for different Regions or Availability Zones. To learn more, see [Guidance for Building a Spot Placement Score Tracker Dashboard on AWS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html). As Spot capacity fluctuates all the time, SPS will help you to identify which combination of instance types, AZs, and regions work best for your workload constraints (i.e. flexibility, performance, size, etc.). If your AI/ML workload requires specific or a limited number of accelerators but is flexible between regions, you can use Spot placement score to dynamically identify the optimal region to deploy your workload before launch. To help you find out automatically the likelihood of acquiring Spot capacity, we provide a guidance for building an SPS tracker dashboard. This solution monitors SPS scores over time using a YAML configuration for diversified setups (e.g., instance requirements including GPUs), stores metrics in CloudWatch, and provides dashboards to compare configurations. Define dashboards per workload to evaluate vCPU, memory, and GPU needs, ensuring optimal setups for EKS clusters including the consideration of using other AWS Regions. To learn more, see [How Spot placement score works](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-sps-works.html).
+  **Gracefully handle Spot interruptions and test**. For a pod with a termination period longer than two minutes, the old node will be interrupted prior to those pods being rescheduled, which could impact workload availability. Consider the two-minute Spot interruption notice when designing your applications, implement checkpointing in long-running applications (e.g., saving progress to persistent storage like Amazon S3) to resume after interruptions, extend the terminationGracePeriodSeconds (default is 30 seconds) in Pod specifications to allow more time for graceful shutdown, and handle interruptions using preStop lifecycle hooks and/or SIGTERM signals within your application for graceful shutdown activities like cleanup, state saving, and connection closure. For real-time workloads, where scaling time is important and workloads take longer than two-minutes for the application to be ready to serve traffic, consider optimizing container start-up and ML model loading times by reviewing [Storage](aiml-storage.md) and [Application Scaling and Performance](aiml-performance.md) best practices. To test a replacement node, use [AWS Fault Injection Service](https://aws.amazon.com/fis/) (FIS) to simulate Spot interruptions.

In addition to these core Spot best practices, take these factors into account when managing GPU workloads on Amazon EKS. Unlike CPU-based workloads, GPU workloads are particularly sensitive to hardware details such as GPU capabilities and available GPU memory. GPU workloads might be constrained by the instance types they can use, with fewer options available compared to CPUs. As a first step, assess if your workload is instance flexible. If you don’t know how many instance types your workload can use, test them individually to ensure compatibility and functionality. Identify how flexible you can be to diversify as much as possible, while confirming that diversification keeps the workload working and understanding any performance impacts (e.g., on throughput or completion time). As part of diversifying your workloads, consider the following:
+  **Review CUDA and framework compatibility**. Your GPU workloads might be optimized for specific hardware, GPU types (e.g., V100 in p3 vs. A100 in p4), or written for specific CUDA versions for libraries like TensorFlow, so be sure to review compatibility for your workloads. This compatibility is crucial to prevent runtime errors, crashes, failures in GPU acceleration (e.g., mismatched CUDA versions with frameworks like PyTorch or TensorFlow can prevent execution), or the ability to leverage hardware features like FP16/INT8 precision.
+  **GPU Memory**. Be sure to evaluate your models' memory requirements and profile your model’s memory usage during runtime using tools like the [DCGM Exporter](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html) and set the minimum GPU memory required for the instance type in well-known labels like karpenter.k8s.aws/instance-gpu-memory. GPU VRAM varies across instance types (e.g., NVIDIA T4 has 16GB, A10G has 24GB, V100 has 16-32GB), and ML models (e.g., large language models) can exceed available memory, causing out-of-memory (OOM) errors or crashes. For Spot Instances in EKS, this may limit diversification. For instance, you can’t include lower-VRAM types if your model doesn’t fit, which may limit access to capacity pools and increase interruption risk. Note that for single GPU, single node inference (e.g., multiple pods scheduled on the same node to utilize its GPU resources), this might limit diversification, as you can only include instance types with sufficient VRAM in your Spot configuration.
+  **Floating-point precision and performance**. Not all Nvidia GPU architectures have the same floating point precision (e.g., FP16/INT8). Evaluate core types (CUDA/Tensor/RT) performance and floating point precision required for your workloads. Running on a lower priced, less performant GPU does not mean it’s better, so consider evaluating performance in terms of work completed within a specific time frame to understand impact of diversification.

 **Scenario: Diversification for real time inference workloads** 

For a real-time online inference workload on Spot Instances, you can configure a Karpenter NodePool to diversify across compatible GPU instance families and generations. This approach ensures high availability by drawing from multiple Spot pools, while maintaining performance through constraints on GPU capabilities, memory, and architecture. It supports using alternatives when instance capacity is constrained, minimizing interruptions and optimizing for inference latency. This example NodePool states, use g and p series instances greater than 3, which have more than 20GB GPU memory.

 **Example** 

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference-spot
spec:
  template:
    metadata:
      labels:
        role: gpu-spot-worker
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"] # Use Spot Instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g", "p"] # Diversifies across G-series and P-series
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"] # Selects instance generations greater than 3
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"] # Specifies AMD64 architecture, compatible with NVIDIA GPUs
        - key: karpenter.k8s.aws/instance-gpu-memory
          operator: Gt
          values: ["20480"] # Ensures more than 20GB (20480 MiB) total GPU memory
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      nodeClassRef:
        name: gpu-inference-ec2
        group: karpenter.k8s.aws
        kind: EC2NodeClass
      expireAfter: 720h
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m # Enables consolidation of underutilized nodes after 5 minutes
```

### Implement Checkpointing for Long Running Training Jobs
<a name="_implement_checkpointing_for_long_running_training_jobs"></a>

Checkpointing is a fault-tolerance technique that involves periodically saving the state of a process, allowing it to resume from the last saved point in case of interruptions. In machine learning, it is commonly associated with training, where long-running jobs can save model weights and optimizer states to resume training after failures, such as hardware issues or Spot Instance interruptions.

You use checkpoints to save the state of machine learning (ML) models during training. Checkpoints are snapshots of the model and can be configured by the callback functions of ML frameworks. You can use the saved checkpoints to restart a training job from the last saved checkpoint. Using checkpoints, you save your model snapshots under training due to an unexpected interruption to the training job or instance. This allows you to resume training the model in the future from a checkpoint. In addition to implementing a node resiliency system, we recommend implementing checkpointing to mitigate the impact of interruptions, including those caused by hardware failures or Amazon EC2 Spot Instance interruptions.

Without checkpointing, interruptions can result in wasted compute time and lost progress, which is costly for long-running training jobs. Checkpointing allows jobs to save their state periodically (e.g., model weights and optimizer states) and resume from the last checkpoint (last processed batch) after an interruption. To implement checkpointing, design your application to process data in large batches and save intermediate results to persistent storage, such as an Amazon S3 bucket via the [Mountpoint for Amazon S3 CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html) while the training job progresses.

 **Use cases** 

Checkpointing is particularly beneficial in specific scenarios to balance fault tolerance with performance overhead. Consider using checkpointing in the following cases:
+  **Job duration exceeds a few hours**: For long-running training jobs (e.g., >1-2 hours for small models, or days/weeks for large foundation models with billions of parameters), where progress loss from interruptions is costly. Shorter jobs may not justify the I/O overhead.
+  **For Spot instances or hardware failures**: In environments prone to interruptions, such as EC2 Spot (2-minute notice) or hardware failures (e.g., GPU memory errors), checkpointing enables quick resumption, making Spot viable for cost savings in fault-tolerant workloads.
+  **Distributed training at scale**: For setups with hundreds/thousands of accelerators (e.g., >100 GPUs), where mean time between failures decreases linearly with scale. Use for model/data parallelism to handle concurrent checkpoint access and avoid complete restarts.
+  **Large-scale models with high resource demands**: In petabyte-scale LLM training, where failures are inevitable due to cluster size; tiered approaches (fast local every 5-30 minutes for transients, durable hourly for major failures) optimize recovery time vs. efficiency.

### Use ML Capacity Blocks for capacity assurance of P and Trainium instances
<a name="_use_ml_capacity_blocks_for_capacity_assurance_of_p_and_trainium_instances"></a>

 [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) allow you to reserve highly sought-after GPU instances, specifically P instances (e.g., p6-b200, p5, p5e, p5en, p4d, p4de) and Trainium instances (e.g., trn1, trn2), to start either almost immediately or on a future date to support your short duration machine learning (ML) workloads. These reservations are ideal for ensuring capacity for compute-intensive tasks like model training and fine-tuning. EC2 Capacity Blocks pricing consists of a reservation fee and an operating system fee. To learn more about pricing, see [EC2 Capacity Blocks for ML pricing](https://aws.amazon.com/ec2/capacityblocks/pricing/).

To reserve GPUs for AI/ML workloads on Amazon EKS for predicable capacity assurance we recommend leveraging ML Capacity Blocks for short-term or [On-Demand Capacity Reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) (ODCRs) for general-purpose capacity assurance.
+ ODCRs allow you to reserve EC2 instance capacity (e.g., GPU instances like g5 or p5) in a specific Availability Zone for a duration, ensuring availability, even during high demand. ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. In EKS, ODCRs are supported by node types like [Karpenter](https://karpenter.sh/) and [managed node groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). To prioritize ODCRs in Karpenter, configure the NodeClass to use the `capacityReservationSelectorTerms` field. See the [Karpenter NodePools Documentation](https://karpenter.sh/docs/concepts/nodeclasses/#speccapacityreservationselectorterms).
+ Capacity Blocks are a specialized reservation mechanism for GPU (e.g., p5, p4d) or Trainium (trn1, trn2) instances, designed for short-term ML workloads like model training, fine-tuning, or experimentation. You reserve capacity for a defined period (typically 24 hours to 182 days) starting on a future date, paying only for the reserved time. They are pre-paid, require pre-planning for capacity needs and do not support autoscaling, but they are colocated in EC2 UltraClusters for low-latency networking. They charge only for the reserved period. To learn more, refer to [Find and purchase Capacity Blocks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-purchase.html), or get started by setting up managed node groups with Capacity Blocks using the instructions in [Create a managed node group with Capacity Blocks for ML](https://docs.aws.amazon.com/eks/latest/userguide/capacity-blocks-mng.html).

Reserve capacity via the AWS Management Console and configure your nodes to use ML capacity blocks. Plan reservations based on workload schedules and test in a staging cluster. Refer to the [Capacity Blocks Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) for more information.

### Consider On-Demand, Amazon EC2 Spot or On-Demand Capacity Reservations (ODCRs) for G Amazon EC2 instances
<a name="_consider_on_demand_amazon_ec2_spot_or_on_demand_capacity_reservations_odcrs_for_g_amazon_ec2_instances"></a>

For G Amazon EC2 Instances consider the different purchase options from On-Demand, Amazon EC2 Spot Instances and On-Demand Capacity Reservations. [ODCRs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allow you to reserve EC2 instance capacity in a specific Availability Zone for a certain duration, ensuring availability even during high demand. Unlike ML Capacity Blocks, which are only available to P and Trainium instances, ODCRs can be used for a wider range of instance types, including G instances, making them suitable for workloads that require different GPU capabilities, such as inference or graphics. When using Amazon EC2 Spot Instances, being able to diverse across different instance types, sizes, and availability zones is key to being able to stay on Spot for longer.

ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. ODCRs can be created for immediate use or scheduled for a future date, providing flexibility in capacity planning. In Amazon EKS, ODCRs are supported by node types like [Karpenter](https://karpenter.sh/) and [managed node groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). To prioritize ODCRs in Karpenter, configure the NodeClass to use the `capacityReservationSelectorTerms` field. See the [Karpenter NodePools Documentation](https://karpenter.sh/docs/concepts/nodepools/). For more information on creating ODCRs, including CLI commands, refer to the [On-Demand Capacity Reservation Getting Started](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations-getting-started.html).

### Consider other accelerated instance types and sizes
<a name="_consider_other_accelerated_instance_types_and_sizes"></a>

Selecting the appropriate accelerated instance and size is essential for optimizing both performance and cost in your ML workloads on Amazon EKS. For example, different GPU instance families have different performance and capabilities such as GPU memory. To help you choose the most price-performant option, review the available GPU instances in the [EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/) page under **Accelerated Computing**. Evaluate multiple instance types and sizes to find the best fit for your specific workload requirements. Consider factors such as the number of GPUs, memory, and network performance. By carefully selecting the right GPU instance type and size, you can achieve better resource utilization and cost efficiency in your EKS clusters.

If you use a GPU instance in an EKS node then it will have the `nvidia-device-plugin-daemonset` pod in the `kube-system` namespace by default. To get a quick sense of whether you are fully utilizing the GPU(s) in your instance, you can use [nvidia-smi](https://docs.nvidia.com/deploy/nvidia-smi/index.html) as shown here:

```
kubectl exec nvidia-device-plugin-daemonset-xxxxx \
  -n kube-system -- nvidia-smi \
  --query-gpu=index,power.draw,power.limit,temperature.gpu,utilization.gpu,utilization.memory,memory.free,memory.used \
  --format=csv -l 5
```
+ If `utilization.memory` is close to 100%, then your code(s) are likely memory bound. This means that the GPU (memory) is fully utilized but could suggest that further performance optimization should be investigated.
+ If the `utilization.gpu` is close to 100%, this does not necessarily mean the GPU is fully utilized. A better metric to look at is the ratio of `power.draw` to `power.limit`. If this ratio is 100% or more, then your code(s) are fully utilizing the compute capacity of the GPU.
+ The `-l 5` flag says to output the metrics every 5 seconds. In the case of a single GPU instance type, the index query flag is not needed.

To learn more, see [GPU instances](https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html) in AWS documentation.

### Optimize GPU Resource Allocation with Time-Slicing, MIG, and Fractional GPU Allocation
<a name="_optimize_gpu_resource_allocation_with_time_slicing_mig_and_fractional_gpu_allocation"></a>

Static resource limits in Kubernetes (e.g., CPU, memory, GPU counts) can lead to over-provisioning or underutilization, particularly for dynamic AI/ML workloads like inference. Selecting the right GPU is important. For low-volume or spiky workloads, time-slicing allows multiple workloads to share a single GPU by sharing its compute resources, potentially improving efficiency and reducing waste. GPU sharing can be achieved through different options:
+  **Leverage Node Selectors / Node affinity to influence scheduling**: Ensure the nodes provisioned and pods are scheduled on the appropriate GPUs for the workload (e.g., `karpenter.k8s.aws/instance-gpu-name: "a100"`)
+  **Time-Slicing**: Schedules workloads to share a GPU’s compute resources over time, allowing concurrent execution without physical partitioning. This is ideal for workloads with variable compute demands, but may lack memory isolation.
+  **Multi-Instance GPU (MIG)**: MIG allows a single NVIDIA GPU to be partitioned into multiple, isolated instances and is supported with NVIDIA Ampere (e.g., A100 GPU), NVIDIA Hopper (e.g., H100 GPU), and NVIDIA Blackwell (e.g., Blackwell GPUs) GPUs. Each MIG instance receives dedicated compute and memory resources, enabling resource sharing in multi-tenant environments or workloads requiring resource guarantees, which allows you to optimize GPU resource utilization, including scenarios like serving multiple models with different batch sizes through time-slicing.
+  **Fractional GPU Allocation**: Uses software-based scheduling to allocate portions of a GPU’s compute or memory to workloads, offering flexibility for dynamic workloads. The [NVIDIA KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler), part of the Run:ai platform, enables this by allowing pods to request fractional GPU resources.

To enable these features in EKS, you can deploy the NVIDIA Device Plugin, which exposes GPUs as schedulable resources and supports time-slicing and MIG. To learn more, see [Time-Slicing GPUs in Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) and [GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances](https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/).

 **Example** 

For example, to enable time-slicing with the NVIDIA Device Plugin:

```
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 pods to share each GPU
```

 **Example** 

For example, to use KAI Scheduler for fractional GPU allocation, deploy it alongside the NVIDIA GPU Operator and specify fractional GPU resources in the pod spec:

```
apiVersion: v1
kind: Pod
metadata:
  name: fractional-gpu-pod-example
  annotations:
    gpu-fraction: "0.5"  # Annotation for 50% GPU
  labels:
    runai/queue: "default"  # Required queue assignment
spec:
  containers:
  - name: ml-workload
    image: nvcr.io/nvidia/pytorch:25.04-py3
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"
  schedulerName: kai-scheduler
```

## Node Resiliency and Training Job Management
<a name="_node_resiliency_and_training_job_management"></a>

### Implement Node Health Checks with Automated Recovery
<a name="_implement_node_health_checks_with_automated_recovery"></a>

For distributed training jobs on Amazon EKS that require frequent inter-node communication, such as multi-GPU model training across multiple nodes, hardware issues like GPU or EFA failures can cause disruptions to training jobs. These disruptions can lead to loss of training progress and increased costs, particularly for long-running AI/ML workloads that rely on stable hardware.

To help add resilience against hardware failures, such as GPU failures in EKS clusters running GPU workloads, we recommend leveraging either the **EKS Node Monitoring Agent** with Auto Repair or **Amazon SageMaker HyperPod**. While the EKS Node Monitoring Agent with Auto Repair provides features like node health monitoring and auto-repair using standard Kubernetes mechanisms, SageMaker HyperPod offers targeted resilience and additional features specifically designed for large-scale ML training, such as deep health checks and automatic job resumption.
+ The [EKS Node Monitoring Agent](https://docs.aws.amazon.com/eks/latest/userguide/node-health.html) with Node Auto Repair continuously monitors node health by reading logs and applying NodeConditions, including standard conditions like `Ready` and conditions specific to accelerated hardware to identify issues like GPU or networking failures. When a node is deemed unhealthy, Node Auto Repair cordons it and replaces it with a new node. The rescheduling of pods and restarting of jobs rely on standard Kubernetes mechanisms and the job’s restart policy.
+ The [SageMaker HyperPod](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US) deep health checks and health-monitoring agent continuously monitors the health status of GPU and Trainium-based instances. It is tailored for AI/ML workloads, using labels (e.g., node-health-status) to manage node health. When a node is deemed unhealthy, HyperPod triggers automatic replacement of the faulty hardware, such as GPUs. It detects networking-related failures for EFA through its basic health checks by default and supports auto-resume for interrupted training jobs, allowing jobs to continue from the last checkpoint, minimizing disruptions for large-scale ML tasks.

For both EKS Node Monitoring Agent with Auto Repair and SageMaker HyperPod clusters using EFA, to monitor EFA-specific metrics such as Remote Direct Memory Access (RDMA) errors and packet drops, make sure the [AWS EFA](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html) driver is installed. In addition, we recommend deploying the [CloudWatch Observability Add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) or using tools like DCGM Exporter with Prometheus and Grafana to monitor EFA, GPU, and, for SageMaker HyperPod, specific metrics related to its features.

### Disable Karpenter Consolidation for interruption sensitive Workloads
<a name="_disable_karpenter_consolidation_for_interruption_sensitive_workloads"></a>

For workload sensitive to interruptions, such as processing, large-scale AI/ML prediction tasks or training, we recommend tuning [Karpenter consolidation policies](https://karpenter.sh/v1.0/concepts/disruption/#consolidation) to prevent disruptions during job execution. Karpenter’s consolidation feature automatically optimizes cluster costs by terminating underutilized nodes or replacing them with lower-priced alternatives. However, even when a workload fully utilizes a GPU, Karpenter may consolidate nodes if it identifies a lower-priced right-sized instance type that meets the pod’s requirements, leading to job interruptions.

The `WhenEmptyOrUnderutilized` consolidation policy may terminate nodes prematurely, leading to longer execution times. For example, interruptions may delay job resumption due to pod rescheduling, data reloading, which could be costly for long-running batch inference jobs. To mitigate this, you can set the `consolidationPolicy` to `WhenEmpty` and configure a `consolidateAfter` duration, such as 1 hour, to retain nodes during workload spikes. For example:

```
disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 60m
```

This approach improves pod startup latency for spiky batch inference workloads and other interruption-sensitive jobs, such as real-time online inference data processing or model training, where the cost of interruption outweighs compute cost savings. Karpenter [NodePool Disruption Budgets](https://karpenter.sh/docs/concepts/disruption/#nodepool-disruption-budgets) is another feature for managing Karpenter disruptions. With budgets, you can make sure that no more than a certain number of nodes nodes will be disrupted in the chosen NodePool at a point in time. You can also use disruption budgets to prevent all nodes from being disrupted at a certain time (e.g. peak hours). To learn more, see [Karpenter Consolidation](https://karpenter.sh/docs/concepts/disruption/#consolidation) documentation.

### Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs
<a name="_use_ttlsecondsafterfinished_to_auto_clean_up_kubernetes_jobs"></a>

We recommend setting `ttlSecondsAfterFinished` for Kubernetes jobs in Amazon EKS to automatically delete completed job objects. Lingering job objects consume cluster resources, such as API server memory, and complicate monitoring by cluttering dashboards (e.g., Grafana, Amazon CloudWatch). For example, setting a TTL of 1 hour ensures jobs are removed shortly after completion, keeping your cluster tidy. For more details, refer to [Automatic Cleanup for Finished Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/).

### Configure Low-Priority Job Preemption for Higher-Priority Jobs/workloads
<a name="_configure_low_priority_job_preemption_for_higher_priority_jobsworkloads"></a>

For mixed-priority AI/ML workloads on Amazon EKS, you may configure low-priority job preemption to ensure higher-priority tasks (e.g., real-time inference) receive resources promptly. Without preemption, low-priority workloads such as batch processes (e.g., batch inference, data processing), non-batch services (e.g., background tasks, cron jobs), or CPU/memory-intensive jobs (e.g., web services) can delay critical pods by occupying nodes. Preemption allows Kubernetes to evict low-priority pods when high-priority pods need resources, ensuring efficient resource allocation on nodes with GPUs, CPUs, or memory. We recommend using Kubernetes `PriorityClass` to assign priorities and `PodDisruptionBudget` to control eviction behavior.

```
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
---
spec:
  priorityClassName: low-priority
```

See the [Kubernetes Priority and Preemption Documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) for more information.

## Application Scaling and Performance
<a name="_application_scaling_and_performance"></a>

### Tailor Compute Capacity for ML workloads with Karpenter or Static Nodes
<a name="_tailor_compute_capacity_for_ml_workloads_with_karpenter_or_static_nodes"></a>

To ensure cost-efficient and responsive compute capacity for machine learning (ML) workflows on Amazon EKS, we recommend tailoring your node provisioning strategy to your workload’s characteristics and cost commitments. Below are two approaches to consider: just-in-time scaling with [Karpenter](https://karpenter.sh/docs/) and static node groups for reserved capacity.
+  **Just-in-time data plane scalers like Karpenter**: For dynamic ML workflows with variable compute demands (e.g., GPU-based inference followed by CPU-based plotting), we recommend using just-in-time data plane scalers like Karpenter.
+  **Use static node groups for predictable workloads**: For predictable, steady-state ML workloads or when using Reserved instances, [EKS managed node groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html) can help ensure reserved capacity is fully provisioned and utilized, maximizing savings. This approach is ideal for specific instance types committed via RIs or ODCRs.

 **Example** 

This is an example of a diverse Karpenter [NodePool](https://karpenter.sh/docs/concepts/nodepools/) that enables launching of `g` Amazon EC2 instances where instance generation is greater than three.

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    cpu: "1000"
    memory: "4000Gi"
    nvidia.com/gpu: "10"  *# Limit the total number of GPUs to 10 for the NodePool*
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60m
    expireAfter: 720h
```

 **Example** 

Example using static node groups for a training workload:

```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ml-cluster
  region: us-west-2
managedNodeGroups:
  - name: gpu-node-group
    instanceType: p4d.24xlarge
    minSize: 2
    maxSize: 2
    desiredCapacity: 2
    taints:
      - key: nvidia.com/gpu
        effect: NoSchedule
```

### Use taints and tolerations to prevent non-accelerated workloads from being scheduled on accelerated instances
<a name="_use_taints_and_tolerations_to_prevent_non_accelerated_workloads_from_being_scheduled_on_accelerated_instances"></a>

Scheduling non accelerated workloads on GPU resources is not compute-efficient, we recommend using taints and toleration to ensure non accelerated workloads pods are not scheduled on inappropriate nodes. See the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) for more information.

### Scale Based on Model Performance
<a name="_scale_based_on_model_performance"></a>

For inference workloads, we recommend using Kubernetes Event-Driven Autoscaling (KEDA) to scale based on model performance metrics like inference requests or token throughput, with appropriate cooldown periods. Static scaling policies may over- or under-provision resources, impacting cost and latency. Learn more in the [KEDA Documentation](https://keda.sh/).

## Dynamic resource allocation for advanced GPU management
<a name="aiml-dra"></a>

 [Dynamic resource allocation (DRA)](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation) represents a fundamental advancement in Kubernetes GPU resource management. DRA moves beyond traditional device plugin limitations to enable sophisticated GPU sharing, topology awareness, and cross-node resource coordination. Available in Amazon EKS [version 1.33](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html#kubernetes-1-33), DRA addresses critical challenges in AI/ML workloads by providing the following:
+ Fine-grained GPU allocation
+ Advanced sharing mechanisms, such as Multi-Process service (MPS) and Multi-Instance GPU (MIG)
+ Support for next-generation hardware architectures, including NVIDIA GB200 UltraServers

Traditional GPU allocation treats GPUs as opaque integer resources, creating significant under-utilization (often 30-40% in production clusters). This occurs because workloads receive exclusive access to entire GPUs even when requiring only fractional resources. DRA transforms this model by introducing structured, declarative allocation that provides the Kubernetes scheduler with complete visibility into hardware characteristics and workload requirements. This enables intelligent placement decisions and efficient resource sharing.

### Advantages of using DRA instead of NVIDIA device plugin
<a name="_advantages_of_using_dra_instead_of_nvidia_device_plugin"></a>

The NVIDIA device plugin (starting from version `0.12.0`) supports GPU sharing mechanisms including time-slicing, MPS, and MIG. However, architectural limitations exist that DRA addresses.

 **NVIDIA device plugin limitations** 
+  **Static configuration:** GPU sharing configurations (time-slicing replicas and MPS settings) require pre-configuration cluster-wide through `ConfigMaps`. This makes providing different sharing strategies for different workloads difficult.
+  **Limited granular selection:** While the device plugin exposes GPU characteristics through node labels, workloads cannot dynamically request specific GPU configurations (memory size and compute capabilities) as part of the scheduling decision.
+  **No cross-node resource coordination:** Cannot manage distributed GPU resources across multiple nodes or express complex topology requirements like NVLink domains for systems like NVIDIA GB200.
+  **Scheduler constraints:** The Kubernetes scheduler treats GPU resources as opaque integers, limiting its ability to make topology-aware decisions or handle complex resource dependencies.
+  **Configuration complexity:** Setting up different sharing strategies requires multiple `ConfigMaps` and careful node labeling, creating operational complexity.

 **Solutions with DRA** 
+  **Dynamic resource selection:** DRA allows workloads to specify detailed requirements (GPU memory, driver versions, and specific attributes) at request time through `resourceclaims`. This enables more flexible resource matching.
+  **Topology awareness:** Through structured parameters and device selectors, DRA handles complex requirements like cross-node GPU communication and memory-coherent interconnects.
+  **Cross-node resource management:** `computeDomains` enable coordination of distributed GPU resources across multiple nodes, critical for systems like GB200 with IMEX channels.
+  **Workload-specific configuration:** Each `ResourceClaim` specifies different sharing strategies and configurations, allowing fine-grained control per workload rather than cluster-wide settings.
+  **Enhanced scheduler integration:** DRA provides the scheduler with detailed device information and enables more intelligent placement decisions based on hardware topology and resource characteristics.

Important: DRA does not replace the NVIDIA device plugin entirely. The NVIDIA DRA driver works alongside the device plugin to provide enhanced capabilities. The device plugin continues to handle basic GPU discovery and management, while DRA adds advanced allocation and scheduling features.

### Instances supported by DRA and their features
<a name="_instances_supported_by_dra_and_their_features"></a>

DRA support varies by Amazon EC2 instance family and GPU architecture, as shown in the following table.


| Instance family | GPU type | Time-slicing | MIG support | MPS support | IMEX support | Use cases | 
| --- | --- | --- | --- | --- | --- | --- | 
|  G5  |  NVIDIA A10G  |  Yes  |  No  |  Yes  |  No  |  Inference and graphics workloads  | 
|  G6  |  NVIDIA L4  |  Yes  |  No  |  Yes  |  No  |  AI inference and video processing  | 
|  G6e  |  NVIDIA L40S  |  Yes  |  No  |  Yes  |  No  |  Training, inference, and graphics  | 
|  P4d/P4de  |  NVIDIA A100  |  Yes  |  Yes  |  Yes  |  No  |  Large-scale training and HPC  | 
|  P5  |  NVIDIA H100  |  Yes  |  Yes  |  Yes  |  No  |  Foundation model training  | 
|  P6  |  NVIDIA B200  |  Yes  |  Yes  |  Yes  |  No  |  Billion or trillion-parameter models, distributed training, and inference  | 
|  P6e  |  NVIDIA GB200  |  Yes  |  Yes  |  Yes  |  Yes  |  Billion or trillion-parameter models, distributed training, and inference  | 

The following are descriptions of each feature in the table:
+  **Time-slicing**: Allows multiple workloads to share GPU compute resources over time.
+  **Multi-Instance GPU (MIG)**: Hardware-level partitioning that creates isolated GPU instances.
+  **Multi-Process service (MPS)**: Enables concurrent execution of multiple CUDA processes on a single GPU.
+  **Internode Memory Exchange (IMEX)**: Memory-coherent communication across nodes for GB200 UltraServers.

### Additional resources
<a name="_additional_resources"></a>

For more information about Kubernetes DRA and NVIDIA DRA drivers, see the following resources on GitHub:
+ Kubernetes [dynamic-resource-allocation](https://github.com/kubernetes/dynamic-resource-allocation) 
+  [Kubernetes enhancement proposal for DRA](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation) 
+  [NVIDIA DRA Driver for GPUs](https://github.com/NVIDIA/k8s-dra-driver-gpu) 
+  [NVIDIA DRA examples and quickstart](https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/demo/specs/quickstart) 

### Set up dynamic resource allocation for advanced GPU management
<a name="aiml-dra-setup"></a>

The following topic shows you how to setup dynamic resource allocation (DRA) for advanced GPU management.

#### Prerequisites
<a name="aiml-dra-prereqs"></a>

Before implementing DRA on Amazon EKS, ensure your environment meets the following requirements.

##### Cluster configuration
<a name="aiml-dra-configuration"></a>
+ Amazon EKS cluster running version `1.33` or later
+ Amazon EKS managed node groups (DRA is currently supported only by managed node groups with AL2023 and Bottlerocket NVIDIA optimized AMIs, [not with Karpenter](https://github.com/kubernetes-sigs/karpenter/issues/1231))
+ NVIDIA GPU-enabled worker nodes with appropriate instance types

##### Required components
<a name="aiml-dra-components"></a>
+ NVIDIA device plugin version `0.17.1` or later
+ NVIDIA DRA driver version `25.3.0` or later

#### Step 1: Create cluster with DRA-enabled node group using eksctl
<a name="aiml-dra-create-cluster"></a>

1. Create a cluster configuration file named `dra-eks-cluster.yaml`:

   ```
   ---
   apiVersion: eksctl.io/v1alpha5
   kind: ClusterConfig
   
   metadata:
     name: dra-eks-cluster
     region: us-west-2
     version: '1.33'
   
   managedNodeGroups:
   - name: gpu-dra-nodes
     amiFamily: AmazonLinux2023
     instanceType: g6.12xlarge
     desiredCapacity: 2
     minSize: 1
     maxSize: 3
   
     labels:
       node-type: "gpu-dra"
       nvidia.com/gpu.present: "true"
   
     taints:
     - key: nvidia.com/gpu
       value: "true"
       effect: NoSchedule
   ```

1. Create the cluster:

   ```
   eksctl create cluster -f dra-eks-cluster.yaml
   ```

#### Step 2: Deploy the NVIDIA device plugin
<a name="aiml-dra-nvidia-plugin"></a>

Deploy the NVIDIA device plugin to enable basic GPU discovery:

1. Add the NVIDIA device plugin Helm repository:

   ```
   helm repo add nvidia https://nvidia.github.io/k8s-device-plugin
   helm repo update
   ```

1. Create custom values for the device plugin:

   ```
   cat <<EOF > nvidia-device-plugin-values.yaml
   gfd:
     enabled: true
   nfd:
     enabled: true
   tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   EOF
   ```

1. Install the NVIDIA device plug-in:

   ```
   helm install nvidia-device-plugin nvidia/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --version v0.17.1 \
    --values nvidia-device-plugin-values.yaml
   ```

#### Step 3: Deploy NVIDIA DRA driver Helm chart
<a name="aiml-dra-helm-chart"></a>

1. Create a `dra-driver-values.yaml` values file for the DRA driver:

   ```
   ---
   nvidiaDriverRoot: /
   
   gpuResourcesEnabledOverride: true
   
   resources:
     gpus:
       enabled: true
     computeDomains:
       enabled: true  # Enable for GB200 IMEX support
   
   controller:
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   
   kubeletPlugin:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: "nvidia.com/gpu.present"
               operator: In
               values: ["true"]
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Add the NVIDIA NGC Helm repository:

   ```
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
   helm repo update
   ```

1. Install the NVIDIA DRA driver:

   ```
   helm install nvidia-dra-driver nvidia/nvidia-dra-driver-gpu \
    --version="25.3.0-rc.2" \
    --namespace nvidia-dra-driver \
    --create-namespace \
    --values dra-driver-values.yaml
   ```

#### Step 4: Verify the DRA installation
<a name="aiml-dra-verify"></a>

1. Verify that the DRA API resources are available:

   ```
   kubectl api-resources | grep resource.k8s.io/v1beta1
   ```

   The following is the expected output:

   ```
   deviceclasses resource.k8s.io/v1beta1 false DeviceClass
   resourceclaims resource.k8s.io/v1beta1 true ResourceClaim
   resourceclaimtemplates resource.k8s.io/v1beta1 true ResourceClaimTemplate
   resourceslices resource.k8s.io/v1beta1 false ResourceSlice
   ```

1. Check the available device classes:

   ```
   kubectl get deviceclasses
   ```

   The following is an example of expected output:

   ```
   NAME                                        AGE
   compute-domain-daemon.nvidia.com            4h39m
   compute-domain-default-channel.nvidia.com   4h39m
   gpu.nvidia.com                              4h39m
   mig.nvidia.com                              4h39m
   ```

   When a newly created G6 GPU instance joins your Amazon EKS cluster with DRA enabled, the following actions occur:
   + The NVIDIA DRA driver automatically discovers the A10G GPU and creates two `resourceslices` on that node.
   + The `gpu.nvidia.com` slice registers the physical A10G GPU device with its specifications (memory, compute capability, and more).
   + Since A10G doesn’t support MIG partitioning, the `compute-domain.nvidia.com` slice creates a single compute domain representing the entire compute context of the GPU.
   + These `resourceslices` are then published to the Kubernetes API server, making the GPU resources available for scheduling through `resourceclaims`.

     The DRA scheduler can now intelligently allocate this GPU to Pods that request GPU resources through `resourceclaimtemplates`, providing more flexible resource management compared to traditional device plugin approaches. This happens automatically without manual intervention. The node simply becomes available for GPU workloads once the DRA driver completes the resource discovery and registration process.

     When you run the following command:

     ```
     kubectl get resourceslices
     ```

     The following is an example of expected output:

     ```
     NAME                                                          NODE                             DRIVER                       POOL                             AGE
     ip-100-64-129-47.ec2.internal-compute-domain.nvidia.com-rwsts ip-100-64-129-47.ec2.internal    compute-domain.nvidia.com    ip-100-64-129-47.ec2.internal    35m
     ip-100-64-129-47.ec2.internal-gpu.nvidia.com-6kndg            ip-100-64-129-47.ec2.internal    gpu.nvidia.com               ip-100-64-129-47.ec2.internal    35m
     ```

Continue to [Schedule a simple GPU workload using dynamic resource allocation](#aiml-dra-workload).

### Schedule a simple GPU workload using dynamic resource allocation
<a name="aiml-dra-workload"></a>

To schedule a simple GPU workload using dynamic resource allocation (DRA), do the following steps. Before proceeding, make sure you have followed [Set up dynamic resource allocation for advanced GPU management](#aiml-dra-setup).

1. Create a basic `ResourceClaimTemplate` for GPU allocation with a file named `basic-gpu-claim-template.yaml`:

   ```
   ---
   apiVersion: v1
   kind: Namespace
   metadata:
     name: gpu-test1
   
   ---
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     namespace: gpu-test1
     name: single-gpu
   spec:
     spec:
       devices:
         requests:
         - name: gpu
           deviceClassName: gpu.nvidia.com
   ```

1. Apply the template:

   ```
   kubectl apply -f basic-gpu-claim-template.yaml
   ```

1. Verify the status:

   ```
   kubectl get resourceclaimtemplates -n gpu-test1
   ```

   The following is example output:

   ```
   NAME         AGE
   single-gpu   9m16s
   ```

1. Create a Pod that uses the `ResourceClaimTemplate` with a file named `basic-gpu-pod.yaml`:

   ```
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     namespace: gpu-test1
     name: gpu-pod
     labels:
       app: pod
   spec:
     containers:
     - name: ctr0
       image: ubuntu:22.04
       command: ["bash", "-c"]
       args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: gpu0
     resourceClaims:
     - name: gpu0
       resourceClaimTemplateName: single-gpu
     nodeSelector:
       NodeGroupType: gpu-dra
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: "nvidia.com/gpu"
       operator: "Exists"
       effect: "NoSchedule"
   ```

1. Apply and monitor the Pod:

   ```
   kubectl apply -f basic-gpu-pod.yaml
   ```

1. Check the Pod status:

   ```
   kubectl get pod -n gpu-test1
   ```

   The following is example expected output:

   ```
   NAME      READY   STATUS    RESTARTS   AGE
   gpu-pod   1/1     Running   0          13m
   ```

1. Check the `ResourceClaim` status:

   ```
   kubectl get resourceclaims -n gpu-test1
   ```

   The following is example expected output:

   ```
   NAME                 STATE                AGE
   gpu-pod-gpu0-l76cg   allocated,reserved   9m6s
   ```

1. View Pod logs to see GPU information:

   ```
   kubectl logs gpu-pod -n gpu-test1
   ```

   The following is example expected output:

   ```
   GPU 0: NVIDIA L4 (UUID: GPU-da7c24d7-c7e3-ed3b-418c-bcecc32af7c5)
   ```

Continue to [GPU optimization techniques with dynamic resource allocation](#aiml-dra-optimization) for more advanced GPU optimization techniques using DRA.

### GPU optimization techniques with dynamic resource allocation
<a name="aiml-dra-optimization"></a>

Modern GPU workloads require sophisticated resource management to achieve optimal utilization and cost efficiency. DRA enables several advanced optimization techniques that address different use cases and hardware capabilities:
+  **Time-slicing** allows multiple workloads to share GPU compute resources over time, making it ideal for inference workloads with sporadic GPU usage. For an example, see [Optimize GPU workloads with time-slicing](#aiml-dra-timeslicing).
+  **Multi-Process service (MPS)** enables concurrent execution of multiple CUDA processes on a single GPU with better isolation than time-slicing. For an example, see [Optimize GPU workloads with MPS](#aiml-dra-mps).
+  **Multi-Instance GPU (MIG)** provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources. For an example, see [Optimize GPU workloads with Multi-Instance GPU](#aiml-dra-mig).
+  **Internode Memory Exchange (IMEX)** enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 systems. For an example, see [Optimize GPU workloads with IMEX using GB200 P6e instances](#aiml-dra-imex).

These techniques can significantly improve resource utilization. Organizations report GPU utilization increases from 30-40% with traditional allocation to 80-90% with optimized sharing strategies. The choice of technique depends on workload characteristics, isolation requirements, and hardware capabilities.

#### Optimize GPU workloads with time-slicing
<a name="aiml-dra-timeslicing"></a>

Time-slicing enables multiple workloads to share GPU compute resources by scheduling them to run sequentially on the same physical GPU. It is ideal for inference workloads with sporadic GPU usage.

Do the following steps.

1. Define a `ResourceClaimTemplate` for time-slicing with a file named `timeslicing-claim-template.yaml`:

   ```
   ---
   apiVersion: v1
   kind: Namespace
   metadata:
     name: timeslicing-gpu
   
   ---
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     name: timeslicing-gpu-template
     namespace: timeslicing-gpu
   spec:
     spec:
       devices:
         requests:
         - name: shared-gpu
           deviceClassName: gpu.nvidia.com
         config:
         - requests: ["shared-gpu"]
           opaque:
             driver: gpu.nvidia.com
             parameters:
               apiVersion: resource.nvidia.com/v1beta1
               kind: GpuConfig
               sharing:
                 strategy: TimeSlicing
   ```

1. Define a Pod using time-slicing with a file named `timeslicing-pod.yaml`:

   ```
   ---
   # Pod 1 - Inference workload
   apiVersion: v1
   kind: Pod
   metadata:
     name: inference-pod-1
     namespace: timeslicing-gpu
     labels:
       app: gpu-inference
   spec:
     restartPolicy: Never
     containers:
     - name: inference-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "-c"]
       args:
       - |
         import torch
         import time
         import os
         print(f"=== POD 1 STARTING ===")
         print(f"GPU available: {torch.cuda.is_available()}")
         print(f"GPU count: {torch.cuda.device_count()}")
         if torch.cuda.is_available():
             device = torch.cuda.current_device()
             print(f"Current GPU: {torch.cuda.get_device_name(device)}")
             print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
             # Simulate inference workload
             for i in range(20):
                 x = torch.randn(1000, 1000).cuda()
                 y = torch.mm(x, x.t())
                 print(f"Pod 1 - Iteration {i+1} completed at {time.strftime('%H:%M:%S')}")
                 time.sleep(60)
         else:
             print("No GPU available!")
             time.sleep(5)
       resources:
         claims:
         - name: shared-gpu-claim
     resourceClaims:
     - name: shared-gpu-claim
       resourceClaimTemplateName: timeslicing-gpu-template
     nodeSelector:
       NodeGroupType: "gpu-dra"
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   
   
   ---
   # Pod 2 - Training workload
   apiVersion: v1
   kind: Pod
   metadata:
     name: training-pod-2
     namespace: timeslicing-gpu
     labels:
       app: gpu-training
   spec:
     restartPolicy: Never
     containers:
     - name: training-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "-c"]
       args:
       - |
         import torch
         import time
         import os
         print(f"=== POD 2 STARTING ===")
         print(f"GPU available: {torch.cuda.is_available()}")
         print(f"GPU count: {torch.cuda.device_count()}")
         if torch.cuda.is_available():
             device = torch.cuda.current_device()
             print(f"Current GPU: {torch.cuda.get_device_name(device)}")
             print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
             # Simulate training workload with heavier compute
             for i in range(15):
                 x = torch.randn(2000, 2000).cuda()
                 y = torch.mm(x, x.t())
                 loss = torch.sum(y)
                 print(f"Pod 2 - Training step {i+1}, Loss: {loss.item():.2f} at {time.strftime('%H:%M:%S')}")
                 time.sleep(5)
         else:
             print("No GPU available!")
             time.sleep(60)
       resources:
         claims:
         - name: shared-gpu-claim-2
     resourceClaims:
     - name: shared-gpu-claim-2
       resourceClaimTemplateName: timeslicing-gpu-template
     nodeSelector:
       NodeGroupType: "gpu-dra"
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Apply the template and Pod:

   ```
   kubectl apply -f timeslicing-claim-template.yaml
   kubectl apply -f timeslicing-pod.yaml
   ```

1. Monitor resource claims:

   ```
   kubectl get resourceclaims -n timeslicing-gpu -w
   ```

   The following is example output:

   ```
   NAME                                      STATE                AGE
   inference-pod-1-shared-gpu-claim-9p97x    allocated,reserved   21s
   training-pod-2-shared-gpu-claim-2-qghnb   pending              21s
   inference-pod-1-shared-gpu-claim-9p97x    pending              105s
   training-pod-2-shared-gpu-claim-2-qghnb   pending              105s
   inference-pod-1-shared-gpu-claim-9p97x    pending              105s
   training-pod-2-shared-gpu-claim-2-qghnb   allocated,reserved   105s
   inference-pod-1-shared-gpu-claim-9p97x    pending              105s
   ```

First Pod (`inference-pod-1`)
+  **State**: `allocated,reserved` 
+  **Meaning**: DRA found an available GPU and reserved it for this Pod
+  **Pod status**: Starts running immediately

Second Pod (`training-pod-2`)
+  **State**: `pending` 
+  **Meaning**: Waiting for DRA to configure time-slicing on the same GPU
+  **Pod status**: Waiting to be scheduled
+ The state will go from `pending` to `allocated,reserved` to `running` 

#### Optimize GPU workloads with MPS
<a name="aiml-dra-mps"></a>

Multi-Process Service (MPS) enables concurrent execution of multiple CUDA contexts on a single GPU with better isolation than time-slicing.

Do the following steps.

1. Define a `ResourceClaimTemplate` for MPS with a file named `mps-claim-template.yaml`:

   ```
   ---
   apiVersion: v1
   kind: Namespace
   metadata:
     name: mps-gpu
   
   ---
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     name: mps-gpu-template
     namespace: mps-gpu
   spec:
     spec:
       devices:
         requests:
         - name: shared-gpu
           deviceClassName: gpu.nvidia.com
         config:
         - requests: ["shared-gpu"]
           opaque:
             driver: gpu.nvidia.com
             parameters:
               apiVersion: resource.nvidia.com/v1beta1
               kind: GpuConfig
               sharing:
                 strategy: MPS
   ```

1. Define a Pod using MPS with a file named `mps-pod.yaml`:

   ```
   ---
   # Single Pod with Multiple Containers sharing GPU via MPS
   apiVersion: v1
   kind: Pod
   metadata:
     name: mps-multi-container-pod
     namespace: mps-gpu
     labels:
       app: mps-demo
   spec:
     restartPolicy: Never
     containers:
     # Container 1 - Inference workload
     - name: inference-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "-c"]
       args:
       - |
         import torch
         import torch.nn as nn
         import time
         import os
   
         print(f"=== INFERENCE CONTAINER STARTING ===")
         print(f"Process ID: {os.getpid()}")
         print(f"GPU available: {torch.cuda.is_available()}")
         print(f"GPU count: {torch.cuda.device_count()}")
   
         if torch.cuda.is_available():
             device = torch.cuda.current_device()
             print(f"Current GPU: {torch.cuda.get_device_name(device)}")
             print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
   
             # Create inference model
             model = nn.Sequential(
                 nn.Linear(1000, 500),
                 nn.ReLU(),
                 nn.Linear(500, 100)
             ).cuda()
   
             # Run inference
             for i in range(1, 999999):
                 with torch.no_grad():
                     x = torch.randn(128, 1000).cuda()
                     output = model(x)
                     result = torch.sum(output)
                     print(f"Inference Container PID {os.getpid()}: Batch {i}, Result: {result.item():.2f} at {time.strftime('%H:%M:%S')}")
                 time.sleep(2)
         else:
             print("No GPU available!")
             time.sleep(60)
       resources:
         claims:
         - name: shared-gpu-claim
           request: shared-gpu
   
     # Container 2 - Training workload
     - name: training-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "-c"]
       args:
       - |
         import torch
         import torch.nn as nn
         import time
         import os
   
         print(f"=== TRAINING CONTAINER STARTING ===")
         print(f"Process ID: {os.getpid()}")
         print(f"GPU available: {torch.cuda.is_available()}")
         print(f"GPU count: {torch.cuda.device_count()}")
   
         if torch.cuda.is_available():
             device = torch.cuda.current_device()
             print(f"Current GPU: {torch.cuda.get_device_name(device)}")
             print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
   
             # Create training model
             model = nn.Sequential(
                 nn.Linear(2000, 1000),
                 nn.ReLU(),
                 nn.Linear(1000, 500),
                 nn.ReLU(),
                 nn.Linear(500, 10)
             ).cuda()
   
             criterion = nn.MSELoss()
             optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
   
             # Run training
             for epoch in range(1, 999999):
                 x = torch.randn(64, 2000).cuda()
                 target = torch.randn(64, 10).cuda()
   
                 optimizer.zero_grad()
                 output = model(x)
                 loss = criterion(output, target)
                 loss.backward()
                 optimizer.step()
   
                 print(f"Training Container PID {os.getpid()}: Epoch {epoch}, Loss: {loss.item():.4f} at {time.strftime('%H:%M:%S')}")
                 time.sleep(3)
         else:
             print("No GPU available!")
             time.sleep(60)
       resources:
         claims:
         - name: shared-gpu-claim
           request: shared-gpu
   
     resourceClaims:
     - name: shared-gpu-claim
       resourceClaimTemplateName: mps-gpu-template
   
     nodeSelector:
       NodeGroupType: "gpu-dra"
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```

1. Apply the template and create multiple MPS Pods:

   ```
   kubectl apply -f mps-claim-template.yaml
   kubectl apply -f mps-pod.yaml
   ```

1. Monitor the resource claims:

   ```
   kubectl get resourceclaims -n mps-gpu -w
   ```

   The following is example output:

   ```
   NAME                                             STATE                AGE
   mps-multi-container-pod-shared-gpu-claim-2p9kx   allocated,reserved   86s
   ```

This configuration demonstrates true GPU sharing using NVIDIA Multi-Process Service (MPS) through dynamic resource allocation (DRA). Unlike time-slicing where workloads take turns using the GPU sequentially, MPS enables both containers to run simultaneously on the same physical GPU. The key insight is that DRA MPS sharing requires multiple containers within a single Pod, not multiple separate Pods. When deployed, the DRA driver allocates one `ResourceClaim` to the Pod and automatically configures MPS to allow both the inference and training containers to execute concurrently.

Each container gets its own isolated GPU memory space and compute resources, with the MPS daemon coordinating access to the underlying hardware. You can verify this is working by doing the following:
+ Checking `nvidia-smi`, which will show both containers as M\$1C (`MPS + Compute`) processes sharing the same GPU device.
+ Monitoring the logs from both containers, which will display interleaved timestamps proving simultaneous execution.

This approach maximizes GPU utilization by allowing complementary workloads to share the expensive GPU hardware efficiently, rather than leaving it underutilized by a single process.

##### Container1: `inference-container`
<a name="_container1_inference_container"></a>

```
root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:09:30 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   48C    P0             28W /   72W |     597MiB /  23034MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1    M+C   python                                  246MiB |
+-----------------------------------------------------------------------------------------+
```

##### Container2: `training-container`
<a name="_container2_training_container"></a>

```
root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:16:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   51C    P0             28W /   72W |     597MiB /  23034MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1    M+C   python                                  314MiB |
+-----------------------------------------------------------------------------------------+
```

#### Optimize GPU workloads with Multi-Instance GPU
<a name="aiml-dra-mig"></a>

Multi-instance GPU (MIG) provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources.

Using dynamic MIG partitioning with various profiles requires the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator). The NVIDIA GPU Operator uses [MIG Manager](https://github.com/NVIDIA/gpu-operator/blob/47fea81ac752a68745300b5ec77f3bd8ee69d059/deployments/gpu-operator/values.yaml#L374) to create MIG profiles and reboots the GPU instances like P4D, P4De, P5, P6, and more to apply the configuration changes. The GPU Operator includes comprehensive MIG management capabilities through the MIG Manager component, which watches for node label changes and automatically applies the appropriate MIG configuration. When a MIG profile change is requested, the operator gracefully shuts down all GPU clients, applies the new partition geometry, and restarts the affected services. This process requires a node reboot for GPU instances to ensure clean GPU state transitions. This is why enabling `WITH0REBOOT=true` in the MIG Manager configuration is essential for successful MIG deployments.

You need both [NVIDIA DRA Driver](https://github.com/NVIDIA/k8s-dra-driver-gpu) and NVIDIA GPU Operator to work with MIG in Amazon EKS. You don’t need NVIDIA Device Plugin and DCGM Exporter in addition to this as these are part of the NVIDIA GPU Operator. Since the EKS NVIDIA AMIs come with the NVIDIA Drivers pre-installed, we disabled the deployment of drivers by the GPU Operator to avoid conflicts and leverage the optimized drivers already present on the instances. The NVIDIA DRA Driver handles dynamic resource allocation for MIG instances, while the GPU Operator manages the entire GPU lifecycle. This includes MIG configuration, device plugin functionality, monitoring through DCGM, and node feature discovery. This integrated approach provides a complete solution for enterprise GPU management, with hardware-level isolation and dynamic resource allocation capabilities.

##### Step 1: Deploy NVIDIA GPU Operator
<a name="_step_1_deploy_nvidia_gpu_operator"></a>

1. Add the NVIDIA GPU Operator repository:

   ```
   helm repo add nvidia https://nvidia.github.io/gpu-operator
   helm repo update
   ```

1. Create a `gpu-operator-values.yaml` file:

   ```
   driver:
     enabled: false
   
   mig:
     strategy: mixed
   
   migManager:
     enabled: true
     env:
       - name: WITH_REBOOT
         value: "true"
     config:
       create: true
       name: custom-mig-parted-configs
       default: "all-disabled"
       data:
         config.yaml: |-
           version: v1
           mig-configs:
             all-disabled:
               - devices: all
                 mig-enabled: false
   
             # P4D profiles (A100 40GB)
             p4d-half-balanced:
               - devices: [0, 1, 2, 3]
                 mig-enabled: true
                 mig-devices:
                   "1g.5gb": 2
                   "2g.10gb": 1
                   "3g.20gb": 1
               - devices: [4, 5, 6, 7]
                 mig-enabled: false
   
             # P4DE profiles (A100 80GB)
             p4de-half-balanced:
               - devices: [0, 1, 2, 3]
                 mig-enabled: true
                 mig-devices:
                   "1g.10gb": 2
                   "2g.20gb": 1
                   "3g.40gb": 1
               - devices: [4, 5, 6, 7]
                 mig-enabled: false
   
   devicePlugin:
     enabled: true
     config:
       name: ""
       create: false
       default: ""
   
   toolkit:
     enabled: true
   
   nfd:
     enabled: true
   
   gfd:
     enabled: true
   
   dcgmExporter:
     enabled: true
     serviceMonitor:
       enabled: true
       interval: 15s
       honorLabels: false
       additionalLabels:
         release: kube-prometheus-stack
   
   nodeStatusExporter:
     enabled: false
   
   operator:
     defaultRuntime: containerd
     runtimeClass: nvidia
     resources:
       limits:
         cpu: 500m
         memory: 350Mi
       requests:
         cpu: 200m
         memory: 100Mi
   
   daemonsets:
     tolerations:
       - key: "nvidia.com/gpu"
         operator: "Exists"
         effect: "NoSchedule"
     nodeSelector:
       accelerator: nvidia
     priorityClassName: system-node-critical
   ```

1. Install GPU Operator using the `gpu-operator-values.yaml` file:

   ```
   helm install gpu-operator nvidia/gpu-operator \
     --namespace gpu-operator \
     --create-namespace \
     --version v25.3.1 \
     --values gpu-operator-values.yaml
   ```

   This Helm chart deploys the following components and multiple MIG profiles:
   + Device Plugin (GPU resource scheduling)
   + DCGM Exporter (GPU metrics and monitoring)
   + Node Feature Discovery (NFD - hardware labeling)
   + GPU Feature Discovery (GFD - GPU-specific labeling)
   + MIG Manager (Multi-instance GPU partitioning)
   + Container Toolkit (GPU container runtime)
   + Operator Controller (lifecycle management)

1. Verify the deployment Pods:

   ```
   kubectl get pods -n gpu-operator
   ```

   The following is example output:

   ```
   NAME                                                              READY   STATUS      RESTARTS        AGE
   gpu-feature-discovery-27rdq                                       1/1     Running     0               3h31m
   gpu-operator-555774698d-48brn                                     1/1     Running     0               4h8m
   nvidia-container-toolkit-daemonset-sxmh9                          1/1     Running     1 (3h32m ago)   4h1m
   nvidia-cuda-validator-qb77g                                       0/1     Completed   0               3h31m
   nvidia-dcgm-exporter-cvzd7                                        1/1     Running     0               3h31m
   nvidia-device-plugin-daemonset-5ljm5                              1/1     Running     0               3h31m
   nvidia-gpu-operator-node-feature-discovery-gc-67f66fc557-q5wkt    1/1     Running     0               4h8m
   nvidia-gpu-operator-node-feature-discovery-master-5d8ffddcsl6s6   1/1     Running     0               4h8m
   nvidia-gpu-operator-node-feature-discovery-worker-6t4w7           1/1     Running     1 (3h32m ago)   4h1m
   nvidia-gpu-operator-node-feature-discovery-worker-9w7g8           1/1     Running     0               4h8m
   nvidia-gpu-operator-node-feature-discovery-worker-k5fgs           1/1     Running     0               4h8m
   nvidia-mig-manager-zvf54                                          1/1     Running     1 (3h32m ago)   3h35m
   ```

1. Create an Amazon EKS cluster with a p4De managed node group for testing the MIG examples:

   ```
   apiVersion: eksctl.io/v1alpha5
   kind: ClusterConfig
   
   metadata:
     name: dra-eks-cluster
     region: us-east-1
     version: '1.33'
   
   managedNodeGroups:
   # P4DE MIG Node Group with Capacity Block Reservation
   - name: p4de-mig-nodes
     amiFamily: AmazonLinux2023
     instanceType: p4de.24xlarge
   
     # Capacity settings
     desiredCapacity: 0
     minSize: 0
     maxSize: 1
   
     # Use specific subnet in us-east-1b for capacity reservation
     subnets:
       - us-east-1b
   
     # AL2023 NodeConfig for RAID0 local storage only
     nodeadmConfig:
       apiVersion: node.eks.aws/v1alpha1
       kind: NodeConfig
       spec:
         instance:
           localStorage:
             strategy: RAID0
   
     # Node labels for MIG configuration
     labels:
       nvidia.com/gpu.present: "true"
       nvidia.com/gpu.product: "A100-SXM4-80GB"
       nvidia.com/mig.config: "p4de-half-balanced"
       node-type: "p4de"
       vpc.amazonaws.com/efa.present: "true"
       accelerator: "nvidia"
   
     # Node taints
     taints:
       - key: nvidia.com/gpu
         value: "true"
         effect: NoSchedule
   
     # EFA support
     efaEnabled: true
   
     # Placement group for high-performance networking
     placementGroup:
       groupName: p4de-placement-group
       strategy: cluster
   
     # Capacity Block Reservation (CBR)
     # Ensure CBR ID matches the subnet AZ with the Nodegroup subnet
     spot: false
     capacityReservation:
       capacityReservationTarget:
         capacityReservationId: "cr-abcdefghij"  # Replace with your capacity reservation ID
   ```

   NVIDIA GPU Operator uses the label added to nodes `nvidia.com/mig.config: "p4de-half-balanced"` and partitions the GPU with the given profile.

1. Login to the `p4de` instance.

1. Run the following command:

   ```
   nvidia-smi -L
   ```

   You should see the following example output:

   ```
   [root@ip-100-64-173-145 bin]# nvidia-smi -L
   GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-ab52e33c-be48-38f2-119e-b62b9935925a)
     MIG 3g.40gb     Device  0: (UUID: MIG-da972af8-a20a-5f51-849f-bc0439f7970e)
     MIG 2g.20gb     Device  1: (UUID: MIG-7f9768b7-11a6-5de9-a8aa-e9c424400da4)
     MIG 1g.10gb     Device  2: (UUID: MIG-498adad6-6cf7-53af-9d1a-10cfd1fa53b2)
     MIG 1g.10gb     Device  3: (UUID: MIG-3f55ef65-1991-571a-ac50-0dbf50d80c5a)
   GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-0eabeccc-7498-c282-0ac7-d3c09f6af0c8)
     MIG 3g.40gb     Device  0: (UUID: MIG-80543849-ea3b-595b-b162-847568fe6e0e)
     MIG 2g.20gb     Device  1: (UUID: MIG-3af1958f-fac4-59f1-8477-9f8d08c55029)
     MIG 1g.10gb     Device  2: (UUID: MIG-401088d2-716f-527b-a970-b1fc7a4ac6b2)
     MIG 1g.10gb     Device  3: (UUID: MIG-8c56c75e-5141-501c-8f43-8cf22f422569)
   GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-1c7a1289-243f-7872-a35c-1d2d8af22dd0)
     MIG 3g.40gb     Device  0: (UUID: MIG-e9b44486-09fc-591a-b904-0d378caf2276)
     MIG 2g.20gb     Device  1: (UUID: MIG-ded93941-9f64-56a3-a9b1-a129c6edf6e4)
     MIG 1g.10gb     Device  2: (UUID: MIG-6c317d83-a078-5c25-9fa3-c8308b379aa1)
     MIG 1g.10gb     Device  3: (UUID: MIG-2b070d39-d4e9-5b11-bda6-e903372e3d08)
   GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-9a6250e2-5c59-10b7-2da8-b61d8a937233)
     MIG 3g.40gb     Device  0: (UUID: MIG-20e3cd87-7a57-5f1b-82e7-97b14ab1a5aa)
     MIG 2g.20gb     Device  1: (UUID: MIG-04430354-1575-5b42-95f4-bda6901f1ace)
     MIG 1g.10gb     Device  2: (UUID: MIG-d62ec8b6-e097-5e99-a60c-abf8eb906f91)
     MIG 1g.10gb     Device  3: (UUID: MIG-fce20069-2baa-5dd4-988a-cead08348ada)
   GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-5d09daf0-c2eb-75fd-3919-7ad8fafa5f86)
   GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-99194e04-ab2a-b519-4793-81cb2e8e9179)
   GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-c1a1910f-465a-e16f-5af1-c6aafe499cd6)
   GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-c2cfafbc-fd6e-2679-e955-2a9e09377f78)
   ```

NVIDIA GPU Operator has successfully applied the `p4de-half-balanced` MIG profile to your P4DE instance, creating hardware-level GPU partitions as configured. Here’s how the partitioning works:

The GPU Operator applied this configuration from your embedded MIG profile:

```
p4de-half-balanced:
  - devices: [0, 1, 2, 3]        # First 4 GPUs: MIG enabled
    mig-enabled: true
    mig-devices:
      "1g.10gb": 2               # 2x small instances (10GB each)
      "2g.20gb": 1               # 1x medium instance (20GB)
      "3g.40gb": 1               # 1x large instance (40GB)
  - devices: [4, 5, 6, 7]        # Last 4 GPUs: Full GPUs
    mig-enabled: false
```

From your `nvidia-smi -L` output, here’s what the GPU Operator created:
+ MIG-enabled GPUs (0-3): hardware partitioned
  + GPU 0: NVIDIA A100-SXM4-80GB
    + MIG 3g.40gb Device 0 – Large workloads (40GB memory, 42 SMs)
    + MIG 2g.20gb Device 1 – Medium workloads (20GB memory, 28 SMs)
    + MIG 1g.10gb Device 2 – Small workloads (10GB memory, 14 SMs)
    + MIG 1g.10gb Device 3 – Small workloads (10GB memory, 14 SMs)
  + GPU 1: NVIDIA A100-SXM4-80GB
    + MIG 3g.40gb Device 0 – Identical partition layout
    + MIG 2g.20gb Device 1
    + MIG 1g.10gb Device 2
    + MIG 1g.10gb Device 3
  + GPU 2 and GPU 3 – Same pattern as GPU 0 and GPU 1
+ Full GPUs (4-7): No MIG partitioning
  + GPU 4: NVIDIA A100-SXM4-80GB – Full 80GB GPU
  + GPU 5: NVIDIA A100-SXM4-80GB – Full 80GB GPU
  + GPU 6: NVIDIA A100-SXM4-80GB – Full 80GB GPU
  + GPU 7: NVIDIA A100-SXM4-80GB – Full 80GB GPU

Once the NVIDIA GPU Operator creates the MIG partitions, the NVIDIA DRA Driver automatically detects these hardware-isolated instances and makes them available for dynamic resource allocation in Kubernetes. The DRA driver discovers each MIG instance with its specific profile (1g.10gb, 2g.20gb, 3g.40gb) and exposes them as schedulable resources through the `mig.nvidia.com` device class.

The DRA driver continuously monitors the MIG topology and maintains an inventory of available instances across all GPUs. When a Pod requests a specific MIG profile through a `ResourceClaimTemplate`, the DRA driver intelligently selects an appropriate MIG instance from any available GPU, enabling true hardware-level multi-tenancy. This dynamic allocation allows multiple isolated workloads to run simultaneously on the same physical GPU while maintaining strict resource boundaries and performance guarantees.

##### Step 2: Test MIG resource allocation
<a name="_step_2_test_mig_resource_allocation"></a>

Now let’s run some examples to demonstrate how DRA dynamically allocates MIG instances to different workloads. Deploy the `resourceclaimtemplates` and test pods to see how the DRA driver places workloads across the available MIG partitions, allowing multiple containers to share GPU resources with hardware-level isolation.

1. Create `mig-claim-template.yaml` to contain the MIG `resourceclaimtemplates`:

   ```
   apiVersion: v1
   kind: Namespace
   metadata:
     name: mig-gpu
   
   ---
   # Template for 3g.40gb MIG instance (Large training)
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     name: mig-large-template
     namespace: mig-gpu
   spec:
     spec:
       devices:
         requests:
         - name: mig-large
           deviceClassName: mig.nvidia.com
           selectors:
           - cel:
               expression: |
                 device.attributes['gpu.nvidia.com'].profile == '3g.40gb'
   
   ---
   # Template for 2g.20gb MIG instance (Medium training)
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     name: mig-medium-template
     namespace: mig-gpu
   spec:
     spec:
       devices:
         requests:
         - name: mig-medium
           deviceClassName: mig.nvidia.com
           selectors:
           - cel:
               expression: |
                 device.attributes['gpu.nvidia.com'].profile == '2g.20gb'
   
   ---
   # Template for 1g.10gb MIG instance (Small inference)
   apiVersion: resource.k8s.io/v1beta1
   kind: ResourceClaimTemplate
   metadata:
     name: mig-small-template
     namespace: mig-gpu
   spec:
     spec:
       devices:
         requests:
         - name: mig-small
           deviceClassName: mig.nvidia.com
           selectors:
           - cel:
               expression: |
                 device.attributes['gpu.nvidia.com'].profile == '1g.10gb'
   ```

1. Apply the three templates:

   ```
   kubectl apply -f mig-claim-template.yaml
   ```

1. Run the following command:

   ```
   kubectl get resourceclaimtemplates -n mig-gpu
   ```

   The following is example output:

   ```
   NAME                  AGE
   mig-large-template    71m
   mig-medium-template   71m
   mig-small-template    71m
   ```

1. Create `mig-pod.yaml` to schedule multiple jobs to leverage this `resourceclaimtemplates`:

   ```
   ---
   # ConfigMap containing Python scripts for MIG pods
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: mig-scripts-configmap
     namespace: mig-gpu
   data:
     large-training-script.py: |
       import torch
       import torch.nn as nn
       import torch.optim as optim
       import time
       import os
   
       print(f"=== LARGE TRAINING POD (3g.40gb) ===")
       print(f"Process ID: {os.getpid()}")
       print(f"GPU available: {torch.cuda.is_available()}")
       print(f"GPU count: {torch.cuda.device_count()}")
   
       if torch.cuda.is_available():
           device = torch.cuda.current_device()
           print(f"Using GPU: {torch.cuda.get_device_name(device)}")
           print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
   
           # Large model for 3g.40gb instance
           model = nn.Sequential(
               nn.Linear(2048, 1024),
               nn.ReLU(),
               nn.Linear(1024, 512),
               nn.ReLU(),
               nn.Linear(512, 256),
               nn.ReLU(),
               nn.Linear(256, 10)
           ).cuda()
   
           optimizer = optim.Adam(model.parameters())
           criterion = nn.CrossEntropyLoss()
   
           print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
   
           # Training loop
           for epoch in range(100):
               # Large batch for 3g.40gb
               x = torch.randn(256, 2048).cuda()
               y = torch.randint(0, 10, (256,)).cuda()
   
               optimizer.zero_grad()
               output = model(x)
               loss = criterion(output, y)
               loss.backward()
               optimizer.step()
   
               if epoch % 10 == 0:
                   print(f"Large Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
               time.sleep(3)
   
           print("Large training completed on 3g.40gb MIG instance")
   
     medium-training-script.py: |
       import torch
       import torch.nn as nn
       import torch.optim as optim
       import time
       import os
   
       print(f"=== MEDIUM TRAINING POD (2g.20gb) ===")
       print(f"Process ID: {os.getpid()}")
       print(f"GPU available: {torch.cuda.is_available()}")
       print(f"GPU count: {torch.cuda.device_count()}")
   
       if torch.cuda.is_available():
           device = torch.cuda.current_device()
           print(f"Using GPU: {torch.cuda.get_device_name(device)}")
           print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
   
           # Medium model for 2g.20gb instance
           model = nn.Sequential(
               nn.Linear(1024, 512),
               nn.ReLU(),
               nn.Linear(512, 256),
               nn.ReLU(),
               nn.Linear(256, 10)
           ).cuda()
   
           optimizer = optim.Adam(model.parameters())
           criterion = nn.CrossEntropyLoss()
   
           print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
   
           # Training loop
           for epoch in range(100):
               # Medium batch for 2g.20gb
               x = torch.randn(128, 1024).cuda()
               y = torch.randint(0, 10, (128,)).cuda()
   
               optimizer.zero_grad()
               output = model(x)
               loss = criterion(output, y)
               loss.backward()
               optimizer.step()
   
               if epoch % 10 == 0:
                   print(f"Medium Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
               time.sleep(4)
   
           print("Medium training completed on 2g.20gb MIG instance")
   
     small-inference-script.py: |
       import torch
       import torch.nn as nn
       import time
       import os
   
       print(f"=== SMALL INFERENCE POD (1g.10gb) ===")
       print(f"Process ID: {os.getpid()}")
       print(f"GPU available: {torch.cuda.is_available()}")
       print(f"GPU count: {torch.cuda.device_count()}")
   
       if torch.cuda.is_available():
           device = torch.cuda.current_device()
           print(f"Using GPU: {torch.cuda.get_device_name(device)}")
           print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
   
           # Small model for 1g.10gb instance
           model = nn.Sequential(
               nn.Linear(512, 256),
               nn.ReLU(),
               nn.Linear(256, 10)
           ).cuda()
   
           print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
   
           # Inference loop
           for i in range(200):
               with torch.no_grad():
                   # Small batch for 1g.10gb
                   x = torch.randn(32, 512).cuda()
                   output = model(x)
                   prediction = torch.argmax(output, dim=1)
   
                   if i % 20 == 0:
                       print(f"Small Inference - Batch {i}, Predictions: {prediction[:5].tolist()}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
               time.sleep(2)
   
           print("Small inference completed on 1g.10gb MIG instance")
   
   ---
   # Pod 1: Large training workload (3g.40gb)
   apiVersion: v1
   kind: Pod
   metadata:
     name: mig-large-training-pod
     namespace: mig-gpu
     labels:
       app: mig-large-training
       workload-type: training
   spec:
     restartPolicy: Never
     containers:
     - name: large-training-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "/scripts/large-training-script.py"]
       volumeMounts:
       - name: script-volume
         mountPath: /scripts
         readOnly: true
       resources:
         claims:
         - name: mig-large-claim
     resourceClaims:
     - name: mig-large-claim
       resourceClaimTemplateName: mig-large-template
     nodeSelector:
       node.kubernetes.io/instance-type: p4de.24xlarge
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
     volumes:
     - name: script-volume
       configMap:
         name: mig-scripts-configmap
         defaultMode: 0755
   
   ---
   # Pod 2: Medium training workload (2g.20gb) - can run on SAME GPU as Pod 1
   apiVersion: v1
   kind: Pod
   metadata:
     name: mig-medium-training-pod
     namespace: mig-gpu
     labels:
       app: mig-medium-training
       workload-type: training
   spec:
     restartPolicy: Never
     containers:
     - name: medium-training-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "/scripts/medium-training-script.py"]
       volumeMounts:
       - name: script-volume
         mountPath: /scripts
         readOnly: true
       resources:
         claims:
         - name: mig-medium-claim
     resourceClaims:
     - name: mig-medium-claim
       resourceClaimTemplateName: mig-medium-template
     nodeSelector:
       node.kubernetes.io/instance-type: p4de.24xlarge
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
     volumes:
     - name: script-volume
       configMap:
         name: mig-scripts-configmap
         defaultMode: 0755
   
   ---
   # Pod 3: Small inference workload (1g.10gb) - can run on SAME GPU as Pod 1 & 2
   apiVersion: v1
   kind: Pod
   metadata:
     name: mig-small-inference-pod
     namespace: mig-gpu
     labels:
       app: mig-small-inference
       workload-type: inference
   spec:
     restartPolicy: Never
     containers:
     - name: small-inference-container
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["python", "/scripts/small-inference-script.py"]
       volumeMounts:
       - name: script-volume
         mountPath: /scripts
         readOnly: true
       resources:
         claims:
         - name: mig-small-claim
     resourceClaims:
     - name: mig-small-claim
       resourceClaimTemplateName: mig-small-template
     nodeSelector:
       node.kubernetes.io/instance-type: p4de.24xlarge
       nvidia.com/gpu.present: "true"
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
     volumes:
     - name: script-volume
       configMap:
         name: mig-scripts-configmap
         defaultMode: 0755
   ```

1. Apply this spec, which should deploy three Pods:

   ```
   kubctl apply -f mig-pod.yaml
   ```

   These Pods should be scheduled by the DRA driver.

1. Check DRA driver Pod logs and you will see output similar to this:

   ```
   I0717 21:50:22.925811 1 driver.go:87] NodePrepareResource is called: number of claims: 1
   I0717 21:50:22.932499 1 driver.go:129] Returning newly prepared devices for claim '933e9c72-6fd6-49c5-933c-a896407dc6d1': [&Device{RequestNames:[mig-large],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-0-mig-9-4-4,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-0-mig-9-4-4**],}]
   I0717 21:50:23.186472 1 driver.go:87] NodePrepareResource is called: number of claims: 1
   I0717 21:50:23.191226 1 driver.go:129] Returning newly prepared devices for claim '61e5ddd2-8c2e-4c19-93ae-d317fecb44a4': [&Device{RequestNames:[mig-medium],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-2-mig-14-0-2,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-2-mig-14-0-2**],}]
   I0717 21:50:23.450024 1 driver.go:87] NodePrepareResource is called: number of claims: 1
   I0717 21:50:23.455991 1 driver.go:129] Returning newly prepared devices for claim '1eda9b2c-2ea6-401e-96d0-90e9b3c111b5': [&Device{RequestNames:[mig-small],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-1-mig-19-2-1,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-1-mig-19-2-1**],}]
   ```

1. Verify the `resourceclaims` to see the Pod status:

   ```
   kubectl get resourceclaims -n mig-gpu -w
   ```

   The following is example output:

   ```
   NAME                                             STATE                AGE
   mig-large-training-pod-mig-large-claim-6dpn8     pending              0s
   mig-large-training-pod-mig-large-claim-6dpn8     pending              0s
   mig-large-training-pod-mig-large-claim-6dpn8     allocated,reserved   0s
   mig-medium-training-pod-mig-medium-claim-bk596   pending              0s
   mig-medium-training-pod-mig-medium-claim-bk596   pending              0s
   mig-medium-training-pod-mig-medium-claim-bk596   allocated,reserved   0s
   mig-small-inference-pod-mig-small-claim-d2t58    pending              0s
   mig-small-inference-pod-mig-small-claim-d2t58    pending              0s
   mig-small-inference-pod-mig-small-claim-d2t58    allocated,reserved   0s
   ```

   As you can see, all the Pods moved from pending to `allocated,reserved` by the DRA driver.

1. Run `nvidia-smi` from the node. You will notice three Python processors are running:

   ```
   root@ip-100-64-173-145 bin]# nvidia-smi
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
   |-----------------------------------------+------------------------+----------------------+
   | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
   | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
   | | | MIG M. |
   |=========================================+========================+======================|
   | 0 NVIDIA A100-SXM4-80GB On | 00000000:10:1C.0 Off | On |
   | N/A 63C P0 127W / 400W | 569MiB / 81920MiB | N/A Default |
   | | | Enabled |
   +-----------------------------------------+------------------------+----------------------+
   | 1 NVIDIA A100-SXM4-80GB On | 00000000:10:1D.0 Off | On |
   | N/A 56C P0 121W / 400W | 374MiB / 81920MiB | N/A Default |
   | | | Enabled |
   +-----------------------------------------+------------------------+----------------------+
   | 2 NVIDIA A100-SXM4-80GB On | 00000000:20:1C.0 Off | On |
   | N/A 63C P0 128W / 400W | 467MiB / 81920MiB | N/A Default |
   | | | Enabled |
   +-----------------------------------------+------------------------+----------------------+
   | 3 NVIDIA A100-SXM4-80GB On | 00000000:20:1D.0 Off | On |
   | N/A 57C P0 118W / 400W | 249MiB / 81920MiB | N/A Default |
   | | | Enabled |
   +-----------------------------------------+------------------------+----------------------+
   | 4 NVIDIA A100-SXM4-80GB On | 00000000:90:1C.0 Off | 0 |
   | N/A 51C P0 77W / 400W | 0MiB / 81920MiB | 0% Default |
   | | | Disabled |
   +-----------------------------------------+------------------------+----------------------+
   | 5 NVIDIA A100-SXM4-80GB On | 00000000:90:1D.0 Off | 0 |
   | N/A 46C P0 69W / 400W | 0MiB / 81920MiB | 0% Default |
   | | | Disabled |
   +-----------------------------------------+------------------------+----------------------+
   | 6 NVIDIA A100-SXM4-80GB On | 00000000:A0:1C.0 Off | 0 |
   | N/A 52C P0 74W / 400W | 0MiB / 81920MiB | 0% Default |
   | | | Disabled |
   +-----------------------------------------+------------------------+----------------------+
   | 7 NVIDIA A100-SXM4-80GB On | 00000000:A0:1D.0 Off | 0 |
   | N/A 47C P0 72W / 400W | 0MiB / 81920MiB | 0% Default |
   | | | Disabled |
   +-----------------------------------------+------------------------+----------------------+
   
   
   +-----------------------------------------------------------------------------------------+
   | MIG devices: |
   +------------------+----------------------------------+-----------+-----------------------+
   | GPU GI CI MIG | Memory-Usage | Vol| Shared |
   | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
   | | | ECC| |
   |==================+==================================+===========+=======================|
   | 0 2 0 0 | 428MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
   | | 2MiB / 32767MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 0 3 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
   | | 0MiB / 16383MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 0 9 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 0 10 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 1 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
   | | 0MiB / 32767MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 1 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
   | | 0MiB / 16383MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 1 13 0 2 | 161MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 2MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 1 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 2 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
   | | 0MiB / 32767MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 2 5 0 1 | 289MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
   | | 2MiB / 16383MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 2 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 2 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 3 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
   | | 0MiB / 32767MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 3 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
   | | 0MiB / 16383MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 3 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   | 3 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
   | | 0MiB / 8191MiB | | |
   +------------------+----------------------------------+-----------+-----------------------+
   
   
   +-----------------------------------------------------------------------------------------+
   | Processes: |
   | GPU GI CI PID Type Process name GPU Memory |
   | ID ID Usage |
   |=========================================================================================|
   **| 0 2 0 64080 C python 312MiB |
   | 1 13 0 64085 C python 118MiB |
   | 2 5 0 64073 C python 210MiB |**
   +-----------------------------------------------------------------------------------------+
   ```

#### Optimize GPU workloads with IMEX using GB200 P6e instances
<a name="aiml-dra-imex"></a>

IMEX (Internode Memory Exchange) enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 UltraServers.

Do the following steps.

1. Define a `ComputeDomain` for multi-node training with a file named `imex-compute-domain.yaml`:

   ```
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: distributed-training-domain
     namespace: default
   spec:
     numNodes: 2
     channel:
       resourceClaimTemplate:
         name: imex-channel-template
   ```

1. Define a Pod using IMEX channels with a file named `imex-pod.yaml`:

   ```
   apiVersion: v1
   kind: Pod
   metadata:
     name: imex-distributed-training
     namespace: default
     labels:
       app: imex-training
   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: nvidia.com/gpu.clique
               operator: Exists
     containers:
     - name: distributed-training
       image: nvcr.io/nvidia/pytorch:25.04-py3
       command: ["bash", "-c"]
       args:
       - |
         echo "=== IMEX Channel Verification ==="
         ls -la /dev/nvidia-caps-imex-channels/
         echo ""
   
         echo "=== GPU Information ==="
         nvidia-smi
         echo ""
   
         echo "=== NCCL Test (if available) ==="
         python -c "
         import torch
         import torch.distributed as dist
         import os
   
         print(f'CUDA available: {torch.cuda.is_available()}')
         print(f'CUDA device count: {torch.cuda.device_count()}')
   
         if torch.cuda.is_available():
             for i in range(torch.cuda.device_count()):
                 print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
   
         # Check for IMEX environment variables
         imex_vars = [k for k in os.environ.keys() if 'IMEX' in k or 'NVLINK' in k]
         if imex_vars:
             print('IMEX Environment Variables:')
             for var in imex_vars:
                 print(f'  {var}={os.environ[var]}')
   
         print('IMEX channel verification completed')
         "
   
         # Keep container running for inspection
         sleep 3600
       resources:
         claims:
         - name: imex-channel-0
         - name: imex-channel-1
     resourceClaims:
     - name: imex-channel-0
       resourceClaimTemplateName: imex-channel-template
     - name: imex-channel-1
       resourceClaimTemplateName: imex-channel-template
     tolerations:
     - key: nvidia.com/gpu
       operator: Exists
       effect: NoSchedule
   ```
**Note**  
This requires P6e GB200 instances.

1. Deploy IMEX by applying the `ComputeDomain` and templates:

   ```
   kubectl apply -f imex-claim-template.yaml
   kubectl apply -f imex-compute-domain.yaml
   kubectl apply -f imex-pod.yaml
   ```

1. Check the `ComputeDomain` status.

   ```
   kubectl get computedomain distributed-training-domain
   ```

1. Monitor the IMEX daemon deployment.

   ```
   kubectl get pods -n nvidia-dra-driver -l resource.nvidia.com/computeDomain
   ```

1. Check the IMEX channels in the Pod:

   ```
   kubectl exec imex-distributed-training -- ls -la /dev/nvidia-caps-imex-channels/
   ```

1. View the Pod logs:

   ```
   kubectl logs imex-distributed-training
   ```

   The following is an example of expected output:

   ```
   === IMEX Channel Verification ===
   total 0
   drwxr-xr-x. 2 root root 80 Jul 8 10:45 .
   drwxr-xr-x. 6 root root 380 Jul 8 10:45 ..
   crw-rw-rw-. 1 root root 241, 0 Jul 8 10:45 channel0
   crw-rw-rw-. 1 root root 241, 1 Jul 8 10:45 channel1
   ```

For more information, see the [NVIDIA example](https://github.com/NVIDIA/k8s-dra-driver-gpu/discussions/249) on GitHub.

# Networking
<a name="aiml-networking"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## Consider Higher Network Bandwidth or Elastic Fabric Adapter For Applications with High Inter-Node Communication
<a name="_consider_higher_network_bandwidth_or_elastic_fabric_adapter_for_applications_with_high_inter_node_communication"></a>

For distributed training workloads on Amazon EKS with high inter-node communication demands, consider selecting instances with higher network bandwidth or [Elastic Fabric Adapter](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html) (EFA). Insufficient network performance can bottleneck data transfer, slowing down machine learning tasks like distributed multi-GPU training. Note that inference workloads don’t typically have high inter-node communication.

 **Example** 

For example, using Karpenter:

```
apiVersion: v1
kind: Pod
metadata:
  name: ml-workload
spec:
  nodeSelector:
    karpenter.k8s.aws/instance-network-bandwidth: "100000"  # 100 Gbps in Mbps
    node.kubernetes.io/instance-type: p5.48xlarge  # EFA-enabled instance
  containers:
  - name: training-job
    image: `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.6.0-gpu-py312-cu124-ubuntu22.04-ec2-v1.6`
    resources:
      limits:
        vpc.amazonaws.com/efa: 1  # Requires EFA device plugin
```

Ensure tools like MPI and NCCL are installed in your container image to leverage EFA for training jobs.

## Increase the number of IP addresses available to enable faster pod launch times
<a name="_increase_the_number_of_ip_addresses_available_to_enable_faster_pod_launch_times"></a>

In EKS, each pod needs an IP address from the VPC CIDR block. As your cluster scales with more nodes and pods, you risk IP address exhaustion or slower performance, but enabling prefix delegation can mitigate these issues by pre-allocating IP ranges and reducing EC2 API calls, resulting in faster pod launch times and improved scalability.

Enabling prefix delegation after creating your cluster allows the VPC Container Network Interface (CNI) to assign IP prefixes (/28, each giving 16 IP addresses) to network interfaces on EC2 instances. This means each node can support more pods, reducing the risk of IP shortages. For example, on a `c5.4xlarge` instance, you can support up to 110 pods with prefix delegation.

While prefix delegation is crucial for optimizing IP usage in environments with many small pods, AI/ML workloads often use fewer, larger pods (e.g., one pod per GPU). Enabling prefix delegation allows the VPC CNI to pre-allocate a prefix for faster pod startup by maintaining a warm pool. This means IP addresses are readily available, reducing the time needed for pod initialization compared to on-demand allocation in non-prefix mode. In such cases, the IP savings from enabling prefix delegation offers performance benefits for AI/ML workloads. By reducing the number of EC2 API calls required for IP address configuration and pre-allocating IP ranges, using prefix delegation enables faster pod launch times, which is particularly beneficial for quickly scaling AI/ML workloads.

To enable prefix delegation:

```
kubectl set env daemonset/aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
```

Ensure proper planning for VPC subnets to avoid IP address exhaustion, especially in large deployments, and manage CIDR blocks to avoid overlaps across VPCs. To learn more, see [Optimizing IP Address Utilization](ip-opt.md) and [Assign more IP addresses to Amazon EKS nodes with prefixes](https://docs.aws.amazon.com/eks/latest/best-practices/ip-opt.html#_plan_for_growth).

# Security
<a name="aiml-security"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## Security and Compliance
<a name="_security_and_compliance"></a>

### Consider S3 with KMS for encryption-compliant storage
<a name="_consider_s3_with_kms_for_encryption_compliant_storage"></a>

Unless you specify otherwise, all S3 buckets use [SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) by default to encrypt objects at rest. However, you can choose to configure buckets to use server-side encryption with AWS Key Management Service (AWS KMS) keys (SSE-KMS) instead. The security controls in AWS KMS can help you meet encryption-related compliance requirements. You can use these KMS keys to protect your data in Amazon S3 buckets. When you use SSE-KMS encryption with an S3 bucket, the AWS KMS keys must be in the same Region as the bucket.

Configure your [general purpose buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) to use [S3 Bucket Keys for SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html#sse-kms-bucket-keys), to reduce your AWS KMS request costs by up to 99 percent by decreasing the request traffic from Amazon S3 to AWS KMS. S3 Bucket Keys [are always enabled](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-UsingKMSEncryption.html#s3-express-sse-kms-bucket-keys) for `GET` and `PUT` operations in a directory bucket and can’t be disabled.

Note that [Amazon S3 Express One Zone](https://aws.amazon.com/s3/storage-classes/express-one-zone/) uses a specific type of bucket called an *S3 directory bucket*. Directory buckets are exclusively for the S3 Express One Zone storage class and enable high-performance, low-latency access. To [configure default bucket encryption on an S3 directory bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-specifying-kms-encryption.html), use the AWS CLI, and specify the KMS key ID or ARN, not the alias, as in the following example:

**Example**  

```
aws s3api put-bucket-encryption --bucket my-directory-bucket --server-side-encryption-configuration \
   '{"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms", "KMSMasterKeyID": "1234abcd-12ab-34cd-56ef-1234567890ab"}}]}'
```
Ensure your EKS pod’s IAM role has KMS permissions (e.g., `kms:Decrypt`) to access encrypted objects. Test this in a staging environment by uploading a sample model to the bucket, mounting it in a pod (e.g., via the Mountpoint S3 CSI driver), and verifying the pod can read the encrypted data without errors. Audit logs via AWS CloudTrail to confirm compliance with encryption requirements. See the [KMS Documentation](https://docs.aws.amazon.com/kms/latest/developerguide/) for setup details and key management.

# Storage
<a name="aiml-storage"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## Data Management and Storage
<a name="_data_management_and_storage"></a>

### Deploy AI Models to Pods Using a CSI Driver
<a name="_deploy_ai_models_to_pods_using_a_csi_driver"></a>

AI/ML workloads often require access to large model artifacts (e.g., trained weights, configurations), and pods need a reliable, scalable way to access these without embedding them in container images, which can increase image sizes and Container registry pull times. To reduce operational overhead of managing volume mounts we recommend deploying AI models to pods by mounting Amazon storage services (e.g., S3, FSx for Lustre, FSx for OpenZFS, EFS) as Persistent Volumes (PVs) using their respective CSI drivers. For implementation details, see subsequent topics in this section.

### Optimize Storage for ML Model Caches on EKS
<a name="_optimize_storage_for_ml_model_caches_on_eks"></a>

Leveraging an optimal storage solution is critical to minimize pod and application start-up latency, reduce memory usage, obtaining the desired levels of performance to accelerate workloads, and ensuring scalability of ML workloads. ML workloads often rely on model files (weights), which can be large and require shared access to data across pods or nodes. Selecting the optimal storage solution depends on your workload’s characteristics, such as single-node efficiency, multi-node access, latency requirements, cost constraints and also data integration requirements (such as with an Amazon S3 data repository). We recommend benchmarking different storage solutions with your workloads to understand which one meets your requirements, and we have provided the following options to help you evaluate based on your workload requirements.

The EKS CSI driver supports the following AWS Storage services, each have their own CSI driver and come with their own strengths for AI and ML workflows:
+  [Mountpoint for Amazon S3](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html) 
+  [Amazon FSx for Lustre](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html) 
+  [Amazon FSx for OpenZFS](https://docs.aws.amazon.com/eks/latest/userguide/fsx-openzfs-csi.html) 
+  [Amazon EFS](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html) 
+  [Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) 

The choice of AWS Storage service depends on your deployment architecture, scale, performance requirements, and cost strategy. Storage CSI drivers need to be installed on your EKS cluster, which allows the CSI driver to create and manage Persistent Volumes (PV) outside the lifecycle of a Pod. Using the CSI driver, you can create PV definitions of supported AWS Storage services as EKS cluster resources. Pods can then access these storage volumes for their data volumes through creating a Persistent Volume Claim (PVC) for the PV. Depending on the AWS storage service and your deployment scenario, a single PVC (and its associated PV) can be attached to multiple Pods for a workload. For example, for ML training, shared training data is stored on a PV and accessed by multiple Pods; for real-time online inference, LLM models are cached on a PV and accessed by multiple Pods. Sample PV and PVC YAML files for AWS Storage services are provided below to help you get started.

 **Monitoring performance** Poor disk performance can delay container image reads, increase pod startup latency, and degrade inference or training throughput. Use [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) to monitor performance metrics for your AWS storage services. When you identify performance bottlenecks, modify your storage configuration parameters to optimize performance.

 **Scenario: Multiple GPU instances workload** 

 **Amazon FSx for Lustre**: In scenarios where you have **multiple EC2 GPU compute instance** environment with latency-sensitive and high-bandwidth throughput dynamic workloads, such as distributed training and model serving, and you require native Amazon S3 data repository integration, we recommend [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html). FSx for Lustre provides a fully managed high performance parallel filesystem that is designed for compute-intensive workloads like high-performance computing (HPC), Machine Learning.

You can [Install the FSx for Lustre CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html) to mount FSx filesystems on EKS as a Persistent Volume (PV), then deploy FSx for Lustre file system as a standalone high performance cache or as an S3-linked file system to act as a high performance cache for S3 data, providing fast I/O and high throughput for data access across your GPU compute instances. FSx for Lustre can be deployed with either Scratch-SSD or Persistent-SSD storage options:
+  **Scratch-SSD storage**: Recommended for workloads that are ephemeral or short-lived (hours), with fixed throughput capacity per-TiB provisioned.
+  **Persistent-SSD storage**: Recommended for mission-critical, long-running workloads that require the highest level of availability, for example HPC simulations, big data analytics or Machine Learning training. With Persistent-SSD storage, you can configure both the storage capacity and throughput capacity (per-TiB) that is required.

 **Performance considerations:** 
+  **Administrative pod to manage FSx for Lustre file system**: Configure an "administrative" Pod that has the lustre client installed and has the FSx file system mounted. This will enable an access point to enable fine-tuning of the FSx file system, and also in situations where you need to pre-warm the FSx file system with your ML training data or LLM models before starting up your GPU compute instances. This is especially important if your architecture utilizes Spot-based Amazon EC2 GPU/compute instances, where you can utilize the administrative Pod to "warm" or "pre-load" desired data into the FSx file system, so that the data is ready to be processed when you run your Spot based Amazon EC2 instances.
+  **Elastic Fabric Adapter (EFA)**: Persistent-SSD storage deployment types support [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html), where using EFA is ideal for high performance and throughput-based GPU-based workloads. Note that FSx for Lustre supports NVIDIA GPUDirect Storage (GDS), where GDS is a technology that creates a direct data path between local or remote storage and GPU memory, to enable faster data access.
+  **Compression**: Enable data compression on the file system if you have file types that can be compressed. This can help to increase performance as data compression reduces the amount of data that is transferred between FSx for Lustre file servers and storage.
+  **Lustre file system striping configuration**:
  +  **Data striping**: Allows FSx for Luster to distribute a file’s data across multiple Object Storage Targets (OSTs) within a Lustre file system maximizes parallel access and throughput, especially for large-scale ML training jobs.
  +  **Standalone file system striping**: By default, a 4-component Lustre striping configuration is created for you via the [Progressive file layouts (PFL)](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#striping-pfl) capability of FSx for Lustre. In most scenarios you don’t need to update the default PFL Lustre stripe count/size. If you need to adjust the Lustre data striping, then you can manually adjust the Lustre striping by referring to [striping parameters of a FSx for Lustre file system](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#striping-data).
  +  **S3-Linked File system**: Files imported into the FSx file system using the native Amazon S3 integration (Data Repository Association or DRA) don’t use the default PFL layout, but instead use the layout in the file system’s `ImportedFileChunkSize` parameter. S3-imported files larger than the `ImportedFileChunkSize` will be stored on multiple OSTs with a stripe count based on the `ImportedFileChunkSize` defined value (default 1GiB). If you have large files, we recommend tuning this parameter to a higher value.
  +  **Placement**: Deploy an FSx for Lustre file system in the same Availability Zone as your compute or GPU nodes to enable the lowest latency access to the data, avoid cross Availability Zone access access patterns. If you have multiple GPU nodes located in different Availability Zones, then we recommend deploying a FSx file system in each Availability Zone for low latency data access.

 **Example** 

Persistent Volume (PV) definition for an FSx for Lustre file system, using Static Provisioning (where the FSx instance has already been provisioned).

```
apiVersion: v1
kind: PersistentVolume
metadata:
  name: fsx-pv
spec:
  capacity:
    storage: 1200Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  mountOptions:
    - flock
  persistentVolumeReclaimPolicy: Recycle
  csi:
    driver: fsx.csi.aws.com
    volumeHandle: [FileSystemId of FSx instance]
    volumeAttributes:
      dnsname: [DNSName of FSx instance]
      mountname: [MountName of FSx instance]
```

 **Example** 

Persistent Volume Claim definition for PV called `fsx-pv`:

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fsx-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 1200Gi
  volumeName: fsx-pv
```

 **Example** 

Configure a pod to use an Persistent Volume Claim of `fsx-claim`:

```
apiVersion: v1
kind: Pod
metadata:
  name: fsx-app
spec:
  containers:
  - name: app
    image: amazonlinux:2023
    command: ["/bin/sh"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: fsx-claim
```

For complete examples, see the [FSx for Lustre Driver Examples in GitHub](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/tree/master/examples/kubernetes). Monitor [Amazon FSx for Lustre performance metrics](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html) using Amazon CloudWatch. When performance bottlenecks are identified, adjust your configuration parameters as needed.

 **Scenario: Single GPU instance workload** 

 **Mountpoint for Amazon S3 with CSI Driver:** You can mount an S3 bucket as a volume in your pods using [Mountpoint for Amazon S3 CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html). This method allows for fine-grained access control over which Pods can access specific S3 buckets. Each pod has its own mountpoint instance and local cache (5-10GB), isolating model loading and read performance between pods. This setup supports pod-level authentication with IAM Roles for Service Accounts (IRSA) and independent model versioning for different models or customers. The trade-off is increased memory usage and API traffic, as each pod issues S3 API calls and maintains its own cache.

 **Example** Partial example of a Pod deployment YAML with CSI Driver:

```
# CSI driver dynamically mounts the S3 bucket for each pod

volumes:
  - name: s3-mount
    csi:
      driver: s3.csi.aws.com
      volumeAttributes:
        bucketName: your-s3-bucket-name
        mountOptions: "--allow-delete"  # Optional
        region: us-west-2

containers:
  - name: inference
    image: your-inference-image
    volumeMounts:
      - mountPath: /models
        name: s3-mount
volumeMounts:
  - name: model-cache
    mountPath: /models
volumes:
  - name: model-cache
    hostPath:
      path: /mnt/s3-model-cache
```

 **Performance considerations:** 
+  **Data caching**: Mountpoint for S3 can cache content to reduce costs and improve performance for repeated reads to the same file. Refer to [Caching configuration](https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#caching-configuration) for caching options and parameters.
+  **Object part-size**: When storing and accessing files over 72GB in size, refer to [Configuring Mountpoint performance](https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#configuring-mountpoint-performance) to understand how to configure the `--read-part-size` and `--write-part-size` command-line parameters to meet your data profile and workload requirements.
+  ** [Shared-cache](https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#shared-cache) ** is designed for objects up to 1MB in size. It does not support large objects. Use the [Local cache](https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#local-cache) option for caching objects in NVMe or EBS volumes on the EKS node.
+  **API request charges**: When performing a high number of file operations with the Mountpoint for S3, API request charges can become a portion of storage costs. To mitigate this, if strong consistency is not required, always enable metadata caching and set the `metadata-ttl` period to reduce the number of API operations to S3.

For more details, see the [Mountpoint for Amazon S3 CSI Driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html) in the Amazon EKS official documentation. We recommend monitoring the performance metrics of [Amazon S3 with Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html) if bottlenecks occur and adjusting your configuration where required.

### Amazon FSx for OpenZFS persistent shared storage
<a name="_amazon_fsx_for_openzfs_persistent_shared_storage"></a>

For scenarios involving multiple EC2 GPU compute instances with latency-sensitive workloads requiring high availability, high performance, cost sensitivity, and multiple pod deployments for different applications, we recommend Amazon FSx for OpenZFS. Some workload examples include real-time inference, reinforcement learning, and training generative adversarial networks. FSx for OpenZFS is particularly beneficial for workloads needing high performance access to a focused directory structure with small files using small IO data access patterns. Also, FSx for OpenZFS provides the flexibility to scale performance independently from storage capacity, helping you achieve optimal cost efficiency by matching storage size to actual needs while maintaining required performance levels

The native [FSx for OpenZFS CSI driver](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver/tree/main) allows for the creation of multiple PVCs to a single file system by creating multiple volumes. This reduces management overhead and maximizes the utilization of the file system’s throughput and IOPS through consolidated application pod deployments on a single file system. Additionally, it includes enterprise features like zero-copy snapshots, zero-copy clones, and user and group quotas which can be dynamically provisioned through the CSI driver.

FSx for OpenZFS supports three different [deployment types](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/availability-durability.html#choosing-single-or-multi) upon creation:
+  **Single-AZ:** Lowest cost option with sub-millisecond latencies, but provides no high-availability at the file system or Availability Zone level. Recommended for development and test workloads or those which have high-availability at the application layer.
+  **Single-AZ (HA):** Provides high-availability at the file system level with sub-millisecond latencies. Recommended for highest performance workloads which require high-availability.
+  **Multi-AZ:** Provides high-availability at the file system level as well as across Availability Zones. Recommended for high-performance workloads that require the additional availability across Availability Zones.

Performance considerations:
+  **Deployment type:** If the additional availability across Availability Zones isn’t a requirement, consider using the Single-AZ (HA) deployment type. This deployment type provides up to 100% of the throughput for writes, maintains sub-millisecond latencies, and the Gen2 file systems have an additional NVMe cache for storing up to terrabytes of frequently accessed data. The Multi-AZ file systems provide up to 75% of the throughput for writes at an increased latency to accomodate for cross-AZ traffic.
+  **Throughput and IOPS:** Both the [throughput](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/managing-throughput-capacity.html) and [IOPS](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/managing-storage-capacity.html) configured for the file system can be scaled up or down post deployment. You can provision up to 10GB/s of disk throughput providing up to 21GB/s of cached data access. The IOPS can be scaled up to 400,000 from disk and the cache can provide over 1 million IOPS. Note that throughput scaling of a Single-AZ file system does cause a brief outage of the file system as no high-availability exists. Throughput scaling of a Single-AZ (HA) or Multi-AZ file system can be done non-disruptively. The SSD IOPS can be scaled once every six hours.
+  **Storage Class:** FSx for OpenZFS supports both the [SSD storage](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/performance-ssd.html) class as well as the [Intelligent-Tiering](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/performance-intelligent-tiering.html) storage class. For AI/ML workloads it is recommended to use the SSD storage class providing consistent performance to the workload keeping the CPU’s/GPU’s as busy as possible.
+  **Compression:** Enable the [LZ4 compression](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/performance.html#perf-data-compression) algorithm if you have a workload that can be compressed. This reduces the amount of data each file consumes in the cache allowing more data to be served directly from the cache as network throughput and IOPS reducing the load on the SSD disk.
+  **Record size:** Most AI/ML workloads will benefit from leaving the default 128KiB [record size](https://docs.aws.amazon.com/fsx/latest/OpenZFSGuide/performance.html#record-size-performance). This value should only be reduced if the dataset consists of large files (above 10GiB) with consistent small block access below 128KiB from the application.

Once the file system is created, an associated root volume is automatically created by the service. It is best practice to store data within child volumes of the root volume on the file system. Using the [FSx for OpenZFS CSI driver](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver/tree/main) you create an associated Persistent Volume Claim to dynamically create the child volume.

Examples:

A Storage Class (SC) definition for an FSx for OpenZFS volume, used to create a child volume of the root volume (\$1ROOT\$1VOL\$1ID) on an existing file system and export the volume to the VPC CIDR (\$1VPC\$1CIDR) using the NFS v4.2 protocol.

```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fsxz-vol-sc
provisioner: fsx.openzfs.csi.aws.com
parameters:
  ResourceType: "volume"
  ParentVolumeId: '"$ROOT_VOL_ID"'
  CopyTagsToSnapshots: 'false'
  DataCompressionType: '"LZ4"'
  NfsExports: '[{"ClientConfigurations": [{"Clients": "$VPC_CIDR", "Options": ["rw","crossmnt","no_root_squash"]}]}]'
  ReadOnly: 'false'
  RecordSizeKiB: '128'
  Tags: '[{"Key": "Name", "Value": "AI-ML"}]'
  OptionsOnDeletion: '["DELETE_CHILD_VOLUMES_AND_SNAPSHOTS"]'
reclaimPolicy: Delete
allowVolumeExpansion: false
mountOptions:
  - nfsvers=4.2
  - rsize=1048576
  - wsize=1048576
  - timeo=600
  - nconnect=16
  - async
```

A dynamically created Persistent Volume Claim (PVC) against the fsxz-vol-sc created above. **Note**, the storage capacity allocated is 1Gi, this is required for FSx for OpenZFS volumes as noted in the [CSI driver FAQ](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver/blob/main/docs/FAQ.md). The volume will be provided the full capacity provisioned to the file system with this configuration. If the volume capacity needs to be restricted you can do so using user or group quotas.

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dynamic-vol-pvc
  namespace: example
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: fsxz-vol-sc
  resources:
    requests:
      storage: 1Gi
```

Configure a pod to mount a volume using the Persistent Volume Claim (PVC) of dynamic-vol-pvc:

```
kind: Pod
apiVersion: v1
metadata:
  name: fsx-app
  namespace: example
spec:
  volumes:
    - name: dynamic-vol-pv
      persistentVolumeClaim:
        claimName: dynamic-vol-pvc
  containers:
    - name: app
      image: amazonlinux:2023
      command: ["/bin/sh"]
      volumeMounts:
        - mountPath: "/mnt/fsxz"
          name: dynamic-vol-pv
```

### Amazon EFS for shared model caches
<a name="_amazon_efs_for_shared_model_caches"></a>

In scenarios where you have a **multiple EC2 GPU compute instance environment** and have dynamic workloads requiring shared model access across multiple nodes and Availability Zones (e.g., real-time online inference with Karpenter) with moderate performance and scalability needs, we recommend using an Amazon Elastic File System (EFS) file system as a Persistent Volume through the EFS CSI Driver. [Amazon EFS](https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html) is a fully managed, highly available, and scalable cloud-based NFS file system that enables EC2 instances and containers with shared file storage, with consistent performance, and where no upfront provisioning of storage is required. Use EFS as the model volume, and mount the volume as a shared filesystem through defining a Persistent Volume on the EKS cluster. Each Persistent Volume Claim (PVC) that is backed by an EFS file system is created as an [EFS Access-point to the EFS file system](https://docs.aws.amazon.com/efs/latest/ug/efs-access-points.html). EFS allows multiple nodes and pods to access the same model files, eliminating the need to sync data to each node’s filesystem. [Install the EFS CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html) to integrate EFS with EKS.

You can deploy an Amazon EFS file system with the following throughput modes:
+  **Bursting Throughput**: Scales throughput with file system size, suitable for varying workloads with occasional bursts.
+  **Provisioned Throughput**: Dedicated throughput, ideal for consistent ML training jobs with predictable performance needs within limits.
+  **Elastic Throughput (recommended for ML)**: Automatically scales based on workload, cost-effectiveness for varying ML workloads.

To view performance specifications, see [Amazon EFS performance specifications](https://docs.aws.amazon.com/efs/latest/ug/performance.html).

 **Performance considerations**:
+ Use Elastic Throughput for varying workloads.
+ Use Standard storage class for active ML workloads.

For complete examples of using Amazon EFS file system as a persistent Volume within your EKS cluster and Pods, refer to the [EFS CSI Driver Examples in GitHub](https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes). Monitor [Amazon EFS performance metrics](https://docs.aws.amazon.com/efs/latest/ug/accessingmetrics.html) using Amazon CloudWatch. When performance bottlenecks are identified, adjust your configuration parameters as needed.

### Use S3 Express One Zone for Latency-Sensitive, Object Oriented Workflows
<a name="_use_s3_express_one_zone_for_latency_sensitive_object_oriented_workflows"></a>

For latency-sensitive AI/ML workloads on Amazon EKS, such as large-scale model training, inference, or high-performance analytics, we recommend using [S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-getting-started.html) for high-performance model storage and retrieval. S3 Express One Zone offers a hierarchical namespace, like a filesystem, where you simply upload to a directory bucket, suitable for "chucking everything in," while maintaining high speed. This is particularly useful if you are accustomed to object-oriented workflows. Alternatively, if you are more accustomed to file systems (e.g., POSIX-compliant), you may prefer Amazon FSx for Lustre or OpenZFS. Amazon S3 Express One Zone stores data in a single Availability Zone (AZ) using directory buckets and offering lower latency than standard S3 buckets, which distribute data across multiple AZs. For best results, make sure to co-locate your EKS compute in the same AZ as your Express One Zone bucket. To learn more about the differences of S3 Express One Zone, see [Differences for directory buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-differences.html).

To access S3 Express One Zone with filesystem semantics, we recommend using the [Mountpoint S3 CSI Driver](https://github.com/awslabs/mountpoint-s3-csi-driver/tree/main), which mounts S3 buckets (including Express One Zone) as a local file system. This translates file operations (e.g., open, read, write) into S3 API calls, providing high-throughput access optimized for read-heavy workloads from multiple clients and sequential writes to new objects. For details on supported operations and limitations (e.g., no full POSIX compliance, but appends and renames supported in Express One Zone), see the [Mountpoint semantics documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md).

 **Performance benefits** 
+ Provides up to 10x faster data access than S3 Standard, with consistent single-digit millisecond latency and up to 80% lower request costs.
+ Scales to handle hundreds of thousands to [millions of requests per second per directory bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-optimizing-performance-design-patterns.html#s3-express-how-directory-buckets-work), avoiding throttling or brownouts seen in standard S3 during extreme loads (e.g., from clusters with tens to hundreds of thousands of GPUs/CPUs saturating networks).
+ Uses a session-based authentication mechanism. Authenticate once to obtain a session token, then perform repeated operations at high speed without per-request auth overhead. This is optimized for workloads like frequent checkpointing or data loading.

 **Recommended use cases** 
+  **Caching**: One of the top use cases of using the Mountpoint S3 CSI Driver with S3 Express One Zone is caching. The first instance reads data from S3 Standard (general purpose), caching it in lower-latency Express One Zone. Subsequent reads by other clients access the cached data faster, which is ideal for multi-node scenarios where multiple EKS nodes read the same data (e.g., shared training datasets). This can improve performance by up to 7x for repeated accesses and reduce compute costs. For workloads requiring full POSIX compliance (e.g., file locking and in-place modifications), consider Amazon FSx for Lustre or OpenZFS as alternatives.
+  **Large-Scale AI/ML training and inference**: Ideal for workloads with hundreds or thousands of compute nodes (e.g., GPUs in EKS clusters) where general purpose S3 throttling could cause delays, wasting expensive compute resources. For example, LLM researchers or organizations running daily model tests/checkpoints benefit from fast, reliable access without breaking regional S3. For smaller-scale workloads (e.g., 10s of nodes), S3 Standard or other storage classes may suffice.
+  **Data pipelines**: Load/prepare models, archive training data, or stream checkpoints. If your team prefers object storage over traditional file systems (e.g., due to familiarity with S3), use this instead of engineering changes for POSIX-compliant options like FSx for Lustre.

 **Considerations** 
+  **Resilience**: Single-AZ design provides 99.999999999% durability (same as standard S3, via redundancy within the AZ) but lower availability (99.95% designed, 99.9% SLA) compared to multi-AZ classes (99.99% availability). It’s less resilient to AZ failures. Use for recreatable or cached data. Consider multi-AZ replication or backups for critical workloads.
+  **API and Feature Support**: Supports a subset of S3 APIs (e.g., no lifecycle policies or replication); may require minor app changes for session authentication or object handling.
+  **EKS Integration**: Co-locate your EKS pods/nodes in the same AZ as the directory bucket to minimize network latency. Use Mountpoint for Amazon S3 or CSI drivers for Kubernetes-native access.
+  **Testing:** Test retrieval latency in a non-production environment to validate performance gains. Monitor for throttling in standard S3 scenarios (e.g., high GPU saturation) and compare.

The S3 Express One Zone storage class is available in multiple regions and integrates with EKS for workloads needing object access without waiting on storage. To learn more, see [Getting started with S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-getting-started.html).

# Observability
<a name="aiml-observability"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## Monitoring and Observability
<a name="_monitoring_and_observability"></a>

### GPU Metrics Explained
<a name="gpu-metrics-explained"></a>

The GPU Utilization metric shows whether the GPU ran any work during the sample window. This metric captures the percentage of time the GPU executed at least one instruction, but it does not reveal how efficiently the GPU used its hardware. A GPU contains multiple Streaming Multiprocessors (SMs), which are the parallel processing units that execute instructions. A 100% utilization reading can mean the GPU ran heavy parallel workloads across all its SMs, or it can mean a single small instruction activated the GPU over the sample period. To understand actual utilization, you need to examine GPU metrics at multiple levels of the hardware architecture. Each Streaming Multiprocessor is built from different core types, and each layer exposes different performance characteristics. Top-level metrics (GPU Utilization, Memory Utilization, GPU Power, and GPU Temperature, visible through nvidia-smi) show whether the device is active. Deeper metrics (SM utilization, SM Activity, and tensor core usage) reveal how efficiently the GPU uses its resources.

### Target high GPU power usage
<a name="target-high-gpu-power-usage"></a>

Underutilized GPUs waste compute capacity and increase costs because workloads fail to engage all GPU components simultaneously. For AI/ML workloads on Amazon EKS, track GPU power usage as a proxy to identify actual GPU activity. GPU Utilization reports the percentage of time the GPU executes any kernel, but it does not reveal whether the Streaming Multiprocessors, memory controllers, and tensor cores are all active at the same time. Power usage exposes this gap because fully engaged hardware draws significantly more power than hardware running lightweight kernels or sitting idle between tasks. Compare power draw against the GPU’s thermal design power (TDP) to spot underutilization, then investigate whether your workload is bottlenecked by CPU preprocessing, network I/O, or inefficient batch sizes.

Set up CloudWatch Container Insights on Amazon EKS to identify pods, nodes, or workloads with low GPU power consumption. This tool integrates directly with Amazon EKS and allows you to monitor GPU power consumption and adjust pod scheduling or instance types when power usage falls below your target levels. If you need advanced visualization or custom dashboards, use NVIDIA’s DCGM-Exporter with Prometheus and Grafana for Kubernetes-native monitoring. Both approaches surface key NVIDIA metrics like `nvidia_smi_power_draw` (GPU power consumption) and `nvidia_smi_temperature_gpu` (GPU temperature). For a list of metrics, explore https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-NVIDIA-GPU.htm. Look for patterns such as consistently low power usage during specific hours or for particular jobs. These trends help you identify where to consolidate workloads or adjust resource allocation.

Static resource limits in Kubernetes (such as CPU, memory, and GPU counts) often lead to over-provisioning or underutilization, especially for dynamic AI/ML workloads like inference where demand fluctuates. Analyze your utilization trends and consolidate workloads onto fewer GPUs. Ensure each GPU reaches full utilization before you allocate additional ones. This approach reduces waste and lowers costs. For detailed guidance on optimizing scheduling and sharing strategies, see the [EKS Compute and Autoscaling best practices](https://docs.aws.amazon.com/eks/latest/best-practices/aiml-compute.html) 

## Observability and Metrics
<a name="_observability_and_metrics"></a>

### Using Monitoring and Observability Tools for your AI/ML Workloads
<a name="using-monitoring-and-observability-tools-for-your-ai-ml"></a>

Modern AI/ML services require coordination across infrastructure, modeling, and application logic. Platform engineers manage the infrastructure and observability stack. They collect, store, and visualize metrics. AI/ML engineers define model-specific metrics and monitor performance under varying load and data distribution. Application developers consume APIs, route requests, and track service-level metrics and user interactions. Without unified observability practices, these teams work in silos and miss critical signals about system health and performance. Establishing shared visibility across environments ensures all stakeholders can detect issues early and maintain reliable service.

Optimizing Amazon EKS clusters for AI/ML workloads presents unique monitoring challenges, especially around GPU memory management. Without proper monitoring, organizations face out-of-memory (OOM) errors, resource inefficiencies, and unnecessary costs. Effective monitoring ensures better performance, resilience, and lower costs for EKS customers. Use a holistic approach that combines three monitoring layers. First, monitor granular GPU metrics using [NVIDIA DCGM Exporter](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html) to track GPU power usage, GPU temperature, SM activity, SM occupancy, and XID errors. Second, monitor inference serving frameworks like [Ray](https://docs.ray.io/en/latest/serve/monitoring.html) and [vLLM](https://docs.vllm.ai/en/v0.8.5/design/v1/metrics.html) to gain distributed workload insights through their native metrics. Third, collect application-level insights to track custom metrics specific to your workload. This layered approach gives you visibility from hardware utilization through application performance.

#### Tools and frameworks
<a name="aiml-observability-tools-and-frameworks"></a>

Several tools and frameworks provide native, out-of-the-box metrics for monitoring AI/ML workloads. These built-in metrics eliminate the need for custom instrumentation and reduce setup time. The metrics focus on performance aspects such as latency, throughput, and token generation, which are critical for inference serving and benchmarking. Using native metrics allows you to start monitoring immediately without building custom collection pipelines.
+  **vLLM**: A high-throughput serving engine for large language models (LLMs) that provides native metrics such as request latency and memory usage.
+  **Ray**: A distributed computing framework that emits metrics for scalable AI workloads, including task execution times and resource utilization.
+  **Hugging Face Text Generation Inference (TGI)**: A toolkit for deploying and serving LLMs, with built-in metrics for inference performance.
+  **NVIDIA genai-perf**: A command-line tool for benchmarking generative AI models, measuring throughput, latency, and LLM-specific metrics, such as requests completed in specific time intervals.

#### Observability methods
<a name="aiml-observability-methods"></a>

We recommend implementing any additional observability mechanisms in one of the following ways.

 **CloudWatch Container Insights** If your organization prefers AWS-native tools with minimal setup, we recommend [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html). It integrates with the [NVIDIA DCGM Exporter](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html) to collect GPU metrics and offers pre-built dashboards for quick insights. Enabled by installing the [CloudWatch Observability add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) on your cluster, Container Insights deploys and manages the lifecycle of the [NVIDIA DCGM Exporter](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html) which collects GPU metrics from Nvidia’s drivers and exposes them to CloudWatch.

After you install Container Insights, CloudWatch automatically detects NVIDIA GPUs in your environment and collects critical health and performance metrics. These metrics appear on curated out-of-the-box dashboards. You can also integrate [Ray](https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/aws.html) and [vLLM](https://docs.vllm.ai/en/latest/) with CloudWatch using the [Unified CloudWatch Agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/UseCloudWatchUnifiedAgent.html) to send their native metrics. This unified approach simplifies observability in EKS environments and lets teams focus on performance tuning and cost optimization instead of building monitoring infrastructure.

For a complete list of available metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html#Container-Insights-metrics-EKS-GPU). For step-by-step guidance on implementing GPU monitoring, refer to [Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights](https://aws.amazon.com/blogs/mt/gain-operational-insights-for-nvidia-gpu-workloads-using-amazon-cloudwatch-container-insights/). For practical examples of optimizing inference latency, see [Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference](https://aws.amazon.com/blogs/machine-learning/optimizing-ai-responsiveness-a-practical-guide-to-amazon-bedrock-latency-optimized-inference/).

 **Managed Prometheus and Grafana** If your organization needs customized dashboards and advanced visualization capabilities, deploy Prometheus with the [NVIDIA DCGM-Exporter](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter) and Grafana for Kubernetes-native monitoring. Prometheus scrapes and stores GPU metrics from the DCGM-Exporter, while Grafana provides flexible visualization and alerting capabilities. This approach gives you more control over dashboard design and metric retention compared to CloudWatch Container Insights.

You can extend this monitoring stack by integrating open source frameworks like Ray and vLLM [Ray and vLLM](https://awslabs.github.io/ai-on-eks/docs/blueprints/inference/GPUs/vLLM-rayserve) to export their native metrics to Prometheus. You can also [connect Grafana to an AWS X-Ray data source](https://docs.aws.amazon.com/grafana/latest/userguide/x-ray-data-source.html) to visualize distributed traces and identify performance bottlenecks across your inference pipeline. This combination provides end-to-end visibility from GPU-level metrics through application-level request flows.

For step-by-step guidance on deploying this monitoring stack, refer to [Monitoring GPU workloads on Amazon EKS using AWS managed open-source services](https://aws.amazon.com/blogs/mt/monitoring-gpu-workloads-on-amazon-eks-using-aws-managed-open-source-services/).

### Consider Monitoring Core Training & Fine-Tuning Metrics
<a name="aiml-consider-monitor-fine-tuning-metrics"></a>

Monitor core training metrics to track the health and performance of your Amazon EKS cluster and the machine learning workloads running on it. Training workloads have different monitoring requirements than inference workloads because they run for extended periods, consume resources differently, and require visibility into model convergence and data pipeline efficiency. The metrics below help you identify bottlenecks, optimize resource allocation, and ensure training jobs complete successfully. For step-by-step guidance on implementing this monitoring approach, refer to [Introduction to observing machine learning workloads on Amazon EKS](https://aws.amazon.com/blogs/containers/part-1-introduction-to-observing-machine-learning-workloads-on-amazon-eks/).

 **Resource Usage Metrics** 

Monitor resource usage metrics to validate that your resources are being properly consumed. These metrics help you identify bottlenecks and root cause performance issues.
+  **CPU, Memory, Network, GPU Power and GPU Temperature** - Monitor these metrics to ensure allocated resources meet workload demands and identify optimization opportunities. Track metrics like gpu\$1memory\$1usage\$1bytes to identify memory consumption patterns and detect peak usage. Calculate percentiles such as the 95th percentile (P95) to understand the highest memory demands during training. This analysis helps you optimize models and infrastructure to avoid OOM errors and reduce costs.
+  **SM Occupancy, SM Activity, FPxx Activity** - Monitor these metrics to understand how the underlying resource on the GPU is being used. Target 0.8 for SM Activity as a [rule of thumb](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics).
+  **Node and Pod Resource Utilization** - Track resource usage at the node and pod level to identify resource contention and potential bottlenecks. Monitor whether nodes approach capacity limits, which can delay pod scheduling and slow training jobs.
+  **Resource Utilization Compared to Requests and Limits** — Compare actual resource usage against configured requests and limits to determine whether your cluster can handle current workloads and accommodate future ones. This comparison reveals whether you need to adjust resource allocations to avoid OOM errors or resource waste.
+  **Internal Metrics from ML Frameworks** - Capture internal training and convergence metrics from ML frameworks such as TensorFlow and PyTorch. These metrics include loss curves, learning rate, batch processing time, and training step duration. Visualize these metrics using TensorBoard or similar tools to track model convergence and identify training inefficiencies.

 **Model Performance Metrics** 

Monitor model performance metrics to validate that your training process produces models that meet accuracy and business requirements. These metrics help you determine when to stop training, compare model versions, and identify performance degradation.
+  **Accuracy, Precision, Recall, and F1-score** — Track these metrics to understand how well your model performs on validation data. Calculate the F1-score on a validation set after each training epoch to assess whether the model is improving and when it reaches acceptable performance levels.
+  **Business-Specific Metrics and KPIs** — Define and track metrics that directly measure the business value of your AI/ML initiatives. For a recommendation system, track metrics like click-through rate or conversion rate to ensure the model drives the intended business outcomes.
+  **Performance over time** — Compare performance metrics across model versions and training runs to identify trends and detect degradation. Track whether newer model versions maintain or improve performance compared to baseline models. This historical comparison helps you decide whether to deploy new models or investigate training issues.

 **Data Quality and Drift Metrics** 

Monitor data quality and drift metrics to ensure your training data remains consistent and representative. Data drift can cause model performance to degrade over time, while data quality issues can prevent models from converging or produce unreliable results.
+  **Statistical Properties of Input Data** — Track statistical properties such as mean, standard deviation, and distribution of input features over time to detect data drift or anomalies. Monitor whether feature distributions shift significantly from your baseline training data. For example, if the mean of a critical feature changes by more than two standard deviations, investigate whether your data pipeline has changed or whether the underlying data source has shifted.
+  **Data Drift Detection and Alerts** — Implement automated mechanisms to detect and alert on data quality issues before they impact training. Use statistical tests such as the Kolmogorov-Smirnov test or chi-squared test to compare current data distributions with your original training data. Set up alerts when tests detect significant drift so you can retrain models with updated data or investigate data pipeline issues.

 **Latency and Throughput Metrics** 

Monitor latency and throughput metrics to identify bottlenecks in your training pipeline and optimize resource utilization. These metrics help you understand where time is spent during training and where to focus optimization efforts.
+  **End-to-End Latency of ML Training Pipelines** — Measure the total time for data to flow through your entire training pipeline, from data ingestion to model update. Track this metric across training runs to identify whether pipeline changes improve or degrade performance. High latency often indicates bottlenecks in data loading, preprocessing, or network communication between nodes.
+  **Training Throughput and Processing Rate** — Track the volume of data your training pipeline processes per unit of time to ensure efficient resource utilization. Monitor metrics such as samples processed per second or batches completed per minute. Low throughput relative to your hardware capacity suggests inefficiencies in data loading, preprocessing, or model computation that waste GPU cycles.
+  **Checkpoint Save and Restore Latency** – Monitor the time required to save model checkpoints to storage (S3, EFS, FSx) and restore them to GPU or CPU memory when resuming jobs or recovering from failures. Slow checkpoint operations extend job recovery time and increase costs. Track checkpoint size, save duration, restore duration, and failure count to identify optimization opportunities such as compression or faster storage tiers.
+  **Data Loading and Preprocessing Time** - Measure the time spent loading data from storage and applying preprocessing transformations. Compare this time against model computation time to determine whether your training is data-bound or compute-bound. If data loading consumes more than 20% of total training time, consider optimizing your data pipeline with caching, prefetching, or faster storage.

 **Error Rates and Failures** 

Monitor error rates and failures throughout your training pipeline to maintain reliability and prevent wasted compute resources. Undetected errors can cause training jobs to fail silently, produce invalid models, or waste hours of GPU time before you notice problems.
+  **Pipeline Error Monitoring** — Track errors across all stages of your ML pipeline, including data preprocessing, model training, and checkpoint operations. Log error types, frequencies, and affected components to quickly identify issues. Common errors include data format mismatches, out-of-memory failures during preprocessing, and checkpoint save failures due to storage limits. Set up alerts when error rates exceed baseline thresholds so you can investigate before errors cascade.
+  **Recurring Error Analysis** — Identify and investigate patterns in recurring errors to prevent future failures and improve pipeline reliability. Analyze logs to find whether specific data samples, batch sizes, or training configurations consistently cause failures. For example, if certain input data types trigger preprocessing errors, add validation checks earlier in the pipeline or update your data cleaning logic. Track the mean time between failures (MTBF) to measure whether your pipeline reliability improves over time.“

 **Kubernetes and EKS Specific Metrics** 

Monitor Kubernetes and EKS metrics to ensure your cluster infrastructure remains healthy and can support your training workloads. These metrics help you detect infrastructure issues before they cause training job failures or performance degradation.
+  **Kubernetes Cluster State Metrics** — Monitor the health and status of Kubernetes objects including pods, nodes, deployments, and services. Track pod status to identify pods stuck in pending, failed, or crash loop states. Monitor node conditions to detect issues like disk pressure, memory pressure, or network unavailability. Use kubectl or monitoring tools to check these metrics continuously and set up alerts when pods fail to start or nodes become unschedulable.
+  **Training Pipeline Execution Metrics** — Track successful and failed pipeline runs, job durations, step completion times, and orchestration errors. Monitor whether training jobs complete within expected time windows and whether failure rates increase over time. Track metrics such as job success rate, average job duration, and time to failure. These metrics help you identify whether infrastructure issues, configuration problems, or data quality issues cause training failures.
+  **AWS Service Metrics** — Track metrics for AWS services that support your EKS infrastructure and training workloads. Monitor S3 metrics such as request latency, error rates, and throughput to ensure data loading performance remains consistent. Track EBS volume metrics including IOPS, throughput, and queue length to detect storage bottlenecks. Monitor VPC flow logs and network metrics to identify connectivity issues between nodes or to external services.
+  **Kubernetes Control Plane Metrics** — Monitor the API server, scheduler, controller manager, and etcd database to detect performance issues or failures that affect cluster operations. Track API server request latency, request rate, and error rate to ensure the control plane responds quickly to scheduling requests. Monitor etcd database size, commit duration, and leader changes to detect stability issues. High API server latency or frequent etcd leader changes can delay pod scheduling and extend training job startup times.

 **Application and Instance Logs** 

Collect and analyze application and instance logs to diagnose issues that metrics alone cannot explain. Logs provide detailed context about errors, state changes, and system events that help you understand why training jobs fail or perform poorly. Correlating logs with metrics allows you to pinpoint root causes faster.
+  **Application Logs** - Collect application logs from your training jobs, data pipelines, and ML frameworks to identify bottlenecks and diagnose failures. These logs capture detailed information about job execution, including data loading errors, model initialization failures, checkpoint save errors, and framework-specific warnings. Correlate log timestamps with metric spikes to understand what caused performance degradation or failures. For example, if GPU utilization drops suddenly, check application logs for errors indicating data pipeline stalls or preprocessing failures. Use centralized logging tools like CloudWatch Logs or Fluent Bit to aggregate logs from all pods and make them searchable.
+  **Instance Logs** - Collect instance-level logs such as system journal logs and dmesg output to detect hardware issues and kernel-level problems. These logs reveal issues like GPU driver errors, memory allocation failures, disk I/O errors, and network interface problems that may not appear in application logs. Correlate instance logs with application logs and metrics to determine whether training failures stem from hardware problems or application issues. For example, if a training job fails with an out-of-memory error, check dmesg logs for kernel OOM killer messages that indicate whether the system ran out of memory or whether the application exceeded its container limits. Set up alerts for critical hardware errors such as GPU XID errors or disk failures so you can replace failing instances before they cause widespread training disruptions.

The following sections show how to collect the metrics described above using two AWS-recommended approaches: [CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html) and Amazon Managed Prometheus [Amazon Managed Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) with [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/getting-started-with-AMG.html). Choose CloudWatch Container Insights if you prefer AWS-native tools with minimal setup and pre-built dashboards. Choose Amazon Managed Prometheus with Amazon Managed Grafana if you need customized dashboards, advanced visualization capabilities, or want to integrate with existing Prometheus-based monitoring infrastructure. For a complete list of available Container Insights metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-enhanced-EKS.html).

### Consider Monitoring Real-time Online Inference Metrics
<a name="_consider_monitoring_real_time_online_inference_metrics"></a>

In real-time systems, low latency is critical for providing timely responses to users or other dependent systems. High latency can degrade user experience or violate performance requirements. Components that influence inference latency include model loading time, pre-processing time, actual prediction time, post-processing time, network transmission time. We recommend monitoring inference latency to ensure low-latency responses that meet service-level agreements (SLAs) and developing custom metrics for the following. Test under expected load, include network latency, account for concurrent requests, and test with varying batch sizes.
+  **Time to First Token (TTFT)** — Amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). For example, in chatbots, you’d check how long it takes to generate the first piece of output (token) after the user asks a question.
+  **End-to-End Latency** — This is the total time from when a request is received to when the response is sent back. For example, measure time from request to response.
+  **Output Tokens Per Second (TPS)** — Indicates how quickly your model generates new tokens after it starts responding. For example, in chatbots, you’d track generation speed for language models for a baseline text.
+  **Error Rate ** — Tracks failed requests, which can indicate performance issues. For example, monitor failed requests for large documents or certain characters.
+  **Throughput** — Measure the number of requests or operations the system can handle per unit of time. For example, track requests per second to handle peak loads.

K/V (Key/Value) cache can be a powerful optimization technique for inference latency, particularly relevant for transformer-based models. K/V cache stores the key and value tensors from previous transformer layer computations, reducing redundant computations during autoregressive inference, particularly in large language models (LLMs). Cache Efficiency Metrics (specifically for K/V or a session cache use):
+  **Cache hit/miss ratio** — For inference setups leveraging caching (K/V or embedding caches), measure how often cache is helping. Low hit rates may indicate suboptimal cache config or workload changes, both of which can increase latency.

In subsequent topics, we demonstrate gathering data for a few of the metrics mentioned above. We will provide examples with the two AWS recommended approaches: [AWS-native CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html) and open-source [Amazon Managed Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-getting-started.html) with [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/getting-started-with-AMG.html). You would choose one of these solutions based on your overall observability needs. See [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-enhanced-EKS.html) for the complete list of Container Insights metrics.

### Tracking GPU Memory Usage
<a name="tracking-gpu-memory-usage"></a>

As discussed in the [Consider Monitoring Core Training & Fine-Tuning Metrics](#aiml-consider-monitor-fine-tuning-metrics) topic, GPU memory usage is essential to prevent out-of-memory (OOM) errors and ensure efficient resource utilization. The following examples show how to instrument your training application to expose a custom histogram metric, `gpu_memory_usage_bytes`, and calculate the P95 memory usage to identify peak consumption. Be sure to test with a sample training job (e.g., fine-tuning a transformer model) in a staging environment.

 **AWS-Native CloudWatch Container Insights Example** 

This sample demonstrates how to instrument your training application to expose `gpu_memory_usage_bytes` as a histogram using the AWS-native approach. Note that your AI/ML container must be configured to emit structured logs in CloudWatch [Embedded Metrics Format (EMF)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html) format. CloudWatch logs parses EMF and publishes the metrics. Use [aws\$1embedded\$1metrics](https://github.com/awslabs/aws-embedded-metrics-python) in your training application to send structured logs in EMF format to CloudWatch Logs, which extracts GPU metrics.

```
from aws_embedded_metrics import metric_scope
import torch
import numpy as np

memory_usage = []

@metric_scope
def log_gpu_memory(metrics):
    # Record current GPU memory usage
    mem = torch.cuda.memory_allocated()
    memory_usage.append(mem)

    # Log as histogram metric
    metrics.set_namespace("MLTraining/GPUMemory")
    metrics.put_metric("gpu_memory_usage_bytes", mem, "Bytes", "Histogram")

    # Calculate and log P95 if we have enough data points
    if len(memory_usage) >= 10:
        p95 = np.percentile(memory_usage, 95)
        metrics.put_metric("gpu_memory_p95_bytes", p95, "Bytes")
        print(f"Current memory: {mem} bytes, P95: {p95} bytes")

# Example usage in training loop
for epoch in range(20):
    # Your model training code would go here
    log_gpu_memory()
```

 **Prometheus and Grafana Example** 

This sample demonstrates how to instrument your training application to expose `gpu_memory_usage_bytes`` as a histogram using the Prometheus client library in Python.

```
from prometheus_client import Histogram
from prometheus_client import start_http_server
import pynvml

start_http_server(8080)
memory_usage = Histogram(
    'gpu_memory_usage_bytes',
    'GPU memory usage during training',
    ['gpu_index'],
    buckets=[1e9, 2e9, 4e9, 8e9, 16e9, 32e9]
)

# Function to get GPU memory usage
def get_gpu_memory_usage():
    if torch.cuda.is_available():
        # Get the current GPU device
        device = torch.cuda.current_device()

        # Get memory usage in bytes
        memory_allocated = torch.cuda.memory_allocated(device)
        memory_reserved = torch.cuda.memory_reserved(device)

        # Total memory usage (allocated + reserved)
        total_memory = memory_allocated + memory_reserved

        return device, total_memory
    else:
        return None, 0

# Get GPU memory usage
gpu_index, memory_used = get_gpu_memory_usage()
```

### Track Inference Request Duration for Real-Time Online Inference
<a name="track-inference-request-duration-for-real-time-online-inference"></a>

As discussed in the [Consider Monitoring Core Training & Fine-Tuning Metrics](#aiml-consider-monitor-fine-tuning-metrics) topic, low latency is critical for providing timely responses to users or other dependent systems. The following examples show how to track a custom histogram metric, `inference_request_duration_seconds`, exposed by your inference application. Calculate the 95th percentile (P95) latency to focus on worst-case scenarios, test with synthetic inference requests (e.g., via Locust) in a staging environment, and set alert thresholds (e.g., >500ms) to detect SLA violations.

 **AWS-Native CloudWatch Container Insights Example** 

This sample demonstrates how to create a custom histogram metric in your inference application for inference\$1request\$1duration\$1seconds using AWS CloudWatch Embedded Metric Format.

```
import boto3
import time
from aws_embedded_metrics import metric_scope, MetricsLogger

cloudwatch = boto3.client('cloudwatch')

@metric_scope
def log_inference_duration(metrics: MetricsLogger, duration: float):
    metrics.set_namespace("ML/Inference")
    metrics.put_metric("inference_request_duration_seconds", duration, "Seconds", "Histogram")
    metrics.set_property("Buckets", [0.1, 0.5, 1, 2, 5])

@metric_scope
def process_inference_request(metrics: MetricsLogger):
    start_time = time.time()

    # Your inference processing code here
    # For example:
    # result = model.predict(input_data)

    duration = time.time() - start_time
    log_inference_duration(metrics, duration)

    print(f"Inference request processed in {duration} seconds")

# Example usage
process_inference_request()
```

 **Prometheus and Grafana Example** 

This sample demonstrates how to create a custom histogram metric in your inference application for inference\$1request\$1duration\$1seconds using the Prometheus client library in Python:

```
from prometheus_client import Histogram
from prometheus_client import start_http_server
import time

start_http_server(8080)
request_duration = Histogram(
    'inference_request_duration_seconds',
    'Inference request latency',
    buckets=[0.1, 0.5, 1, 2, 5]
)
start_time = time.time()
# Process inference request
request_duration.observe(time.time() - start_time)
```

In Grafana, use the query `histogram_quantile(0.95, sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, pod))` to visualize P95 latency trends. To learn more, see [Prometheus Histogram Documentation](https://prometheus.io/docs/practices/histograms/) and [Prometheus Client Documentation](https://github.com/prometheus/client_python).

### Track Token Throughput for Real-Time Online Inference
<a name="_track_token_throughput_for_real_time_online_inference"></a>

As discussed in the [Consider Monitoring Core Training & Fine-Tuning Metrics](#aiml-consider-monitor-fine-tuning-metrics) topic, we recommend monitoring token processing time to gauge model performance and optimize scaling decisions. The following examples show how to track a custom histogram metric, `token_processing_duration_seconds`, exposed by your inference application. Calculate the 95th percentile (P95) duration to analyze processing efficiency, test with simulated request loads (e.g., 100 to 1000 requests/second) in a non-production cluster, and adjust KEDA triggers to optimize scaling.

 **AWS-Native CloudWatch Container Insights Example** 

This sample demonstrates how to create a custom histogram metric in your inference application for token\$1processing\$1duration\$1seconds using AWS CloudWatch Embedded Metric Format. It uses dimensions (`set\$1dimension`) with a custom `get_duration_bucket` function to categorize durations into buckets (e.g., "⇐0.01", ">1").

```
import boto3
import time
from aws_embedded_metrics import metric_scope, MetricsLogger

cloudwatch = boto3.client('cloudwatch')

@metric_scope
def log_token_processing(metrics: MetricsLogger, duration: float, token_count: int):
    metrics.set_namespace("ML/TokenProcessing")
    metrics.put_metric("token_processing_duration_seconds", duration, "Seconds")
    metrics.set_dimension("ProcessingBucket", get_duration_bucket(duration))
    metrics.set_property("TokenCount", token_count)

def get_duration_bucket(duration):
    buckets = [0.01, 0.05, 0.1, 0.5, 1]
    for bucket in buckets:
        if duration <= bucket:
            return f"<={bucket}"
    return f">{buckets[-1]}"

@metric_scope
def process_tokens(input_text: str, model, tokenizer, metrics: MetricsLogger):
    tokens = tokenizer.encode(input_text)
    token_count = len(tokens)

    start_time = time.time()
    # Process tokens (replace with your actual processing logic)
    output = model(tokens)
    duration = time.time() - start_time

    log_token_processing(metrics, duration, token_count)
    print(f"Processed {token_count} tokens in {duration} seconds")
    return output
```

 **Prometheus and Grafana Example** 

This sample demonstrates how to create a custom histogram metric in your inference application for token\$1processing\$1duration\$1seconds using the Prometheus client library in Python.

```
from prometheus_client import Histogram
from prometheus_client import start_http_server
import time

start_http_server(8080)
token_duration = Histogram(
    'token_processing_duration_seconds',
    'Token processing time per request',
    buckets=[0.01, 0.05, 0.1, 0.5, 1]
)
start_time = time.time()
# Process tokens
token_duration.observe(time.time() - start_time)
```

In Grafana, use the query `histogram_quantile(0.95, sum(rate(token_processing_duration_seconds_bucket[5m])) by (le, pod))`` to visualize P95 processing time trends. To learn more, see [Prometheus Histogram Documentation](https://github.com/prometheus/client_python) and [Prometheus Client Documentation](https://github.com/prometheus/client_python).

 **Measure Checkpoint Restore Latency** 

As discussed in the [Consider Monitoring Core Training & Fine-Tuning Metrics](#aiml-consider-monitor-fine-tuning-metrics) topic, checkpoint latency is a critical metric during multiple phases of the model lifecycle. The following examples show how to track a custom histogram metric, `checkpoint_restore_duration_seconds``, exposed by your application. Calculate the 95th percentile (P95) duration to monitor restore performance, test with Spot interruptions in a non-production cluster, and set alert thresholds (e.g., <30 seconds) to detect delays.

 **AWS-Native CloudWatch Container Insights Example** 

This sample demonstrates how to instrument your batch application to expose checkpoint\$1restore\$1duration\$1seconds as a histogram using CloudWatch Insights:

```
import boto3
import time
import torch
from aws_embedded_metrics import metric_scope, MetricsLogger

@metric_scope
def log_checkpoint_restore(metrics: MetricsLogger, duration: float):
    metrics.set_namespace("ML/ModelOperations")
    metrics.put_metric("checkpoint_restore_duration_seconds", duration, "Seconds", "Histogram")
    metrics.set_property("Buckets", [1, 5, 10, 30, 60])
    metrics.set_property("CheckpointSource", "s3://my-bucket/checkpoint.pt")

@metric_scope
def load_checkpoint(model, checkpoint_path: str, metrics: MetricsLogger):
    start_time = time.time()

    # Load model checkpoint
    model.load_state_dict(torch.load(checkpoint_path))

    duration = time.time() - start_time
    log_checkpoint_restore(metrics, duration)

    print(f"Checkpoint restored in {duration} seconds")
```

 **Prometheus and Grafana Example** 

This sample demonstrates how to instrument your batch application to expose `checkpoint_restore_duration_seconds` as a histogram using the Prometheus client library in Python:

```
from prometheus_client import Histogram
from prometheus_client import start_http_server
import torch

start_http_server(8080)
restore_duration = Histogram(
    'checkpoint_restore_duration_seconds',
    'Time to restore checkpoint',
    buckets=[1, 5, 10, 30, 60]
)
with restore_duration.time():
    model.load_state_dict(torch.load("s3://my-bucket/checkpoint.pt"))
```

In Grafana, use the query `histogram_quantile(0.95, sum(rate(checkpoint_restore_duration_seconds_bucket[5m]) by (le))` to visualize P95 restore latency trends. To learn more, see [Prometheus Histogram Documentation](https://prometheus.io/docs/practices/histograms/) and [Prometheus Client Documentation](https://github.com/prometheus/client_python).

# Application Scaling and Performance
<a name="aiml-performance"></a>

**Tip**  
 [Explore](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el) best practices through Amazon EKS workshops.

## Managing ML Artifacts, Serving Frameworks, and Startup Optimization
<a name="_managing_ml_artifacts_serving_frameworks_and_startup_optimization"></a>

Deploying machine learning (ML) models on Amazon EKS requires thoughtful consideration of how models are integrated into container images and runtime environments. This ensures scalability, reproducibility, and efficient resource utilization. This topic describes the different approaches to handling ML model artifacts, selecting serving frameworks, and optimizing container startup times through techniques like pre-caching, all tailored to reduce container startup times.

### Handling ML Model Artifacts in Deployments
<a name="_handling_ml_model_artifacts_in_deployments"></a>

A key decision is how to handle the ML model artifacts (such as weights and configurations) themselves. The choice impacts image size, deployment speed, model update frequency, and operational overhead. Note that when referring to storing the "model", we are referring to the model artifacts (such as trained parameters and model weights). There are different approaches to handling ML model artifacts on Amazon EKS. Each has its trade-offs, and the best one depends on your model’s size, update cadence, and infrastructure needs. Consider the following approaches from least to most recommended:
+  **Baking the model into the container image**: Copy the model files (e.g., .safetensors, .pth, .h5) into the container image (e.g., Dockerfile) during image build. The model is part of the immutable image. We recommend using this approach for smaller models with infrequent updates. This ensures consistency and reproducibility, provides no loading delay, and simplifies dependency management, but results in larger image sizes, slowing builds and pushes, requires rebuilding and redeploying for model updates, and it is not ideal for large models due to registry pull throughput.
+  **Downloading the model at runtime**: At container startup, the application downloads the model from external storage (e.g., Amazon S3, backed by S3 CRT for optimized high-throughput transfers using methods such as Mountpoint for S3 CSI driver, AWS S3 CLI, or s5cmd OSS CLI) via scripts in an init container or entrypoint. We recommend starting with this approach for large models with frequent updates. This keeps container images focused on code/runtime, enables easy model updates without rebuilds, supports versioning via storage metadata, but it introduces potential network failures (requires retry logic), it requires authentication and caching.

To learn more, see [Accelerating pull process](https://awslabs.github.io/ai-on-eks/docs/guidance/container-startup-time/accelerate-pull-process) in the AI on EKS Workshop.

### Serving ML Models
<a name="_serving_ml_models"></a>

Deploying and serving machine learning (ML) models on Amazon EKS requires selecting an appropriate model serving approach to optimize for latency, throughput, scalability, and operational simplicity. The choice depends on your model type (e.g., language, vision model), workload demands (e.g., real-time inference), and team expertise. Common approaches include Python-based setups for prototyping, dedicated model servers for production-grade features, and specialized inference engines for high-performance and efficiency. Each method involves trade-offs in setup complexity, performance, and resource utilization. Note that serving frameworks may increase container image sizes (multiple GBs) due to dependencies, potentially impacting startup times—consider decoupling using artifact handling techniques to mitigate this. Options are listed from least to most recommended:

 **Using Python frameworks (e.g., FastAPI, HuggingFace Transformers with PyTorch)** Develop a custom application using Python frameworks, embedding model files (weights, config, tokenizer) within a containerized node setup.
+  **Pros**: Easy prototyping, Python-only with no extra infrastructure, compatible with all HuggingFace models, simple Kubernetes deployment.
+  **Cons**: Restricts to single request/simple batching, slow token generation (no optimized kernels), memory inefficient, lacks scaling/monitoring, and involves long startup times.
+  **Recommendation**: Use for initial prototyping or single-node tasks requiring custom logic integration.

 **Using dedicated model serving frameworks (e.g., TensorRT-LLM, TGI)** Adopt specialized servers like TensorRT-LLM or TGI for ML inference, managing model loading, routing, and optimization. These support formats like safetensors, with optional compilation or plugins.
+  **Pros**: Offers batching (static/in-flight or continuous), quantization (INT8, FP8, GPTQ), hardware optimizations (NVIDIA, AMD, Intel, Inferentia), and multi-GPU support (Tensor/Pipeline Parallelism). TensorRT-LLM supports diverse models (LLMs, Encoder-Decoder), while TGI leverages HuggingFace integration.
+  **Cons**: TensorRT-LLM needs compilation and is NVIDIA-only; TGI may be less efficient in batching; both add configuration overhead and may not fit all model types (e.g., non-transformers).
+  **Recommendation**: Suitable for PyTorch/TensorFlow models needing production capabilities like A/B testing or high throughput with compatible hardware.

 **Using specialized high-throughput inference engines (e.g., vLLM)** Utilize advanced inference engines like vLLM, optimizing LLM serving with PagedAttention, in-flight batching, and quantization (INT8, FP8-KV, AWQ), integrable with EKS autoscaling.
+  **Pros**: High throughput and memory efficiency (40-60% VRAM savings), dynamic request handling, token streaming, single-node Tensor Parallel multi-GPU support, and broad hardware compatibility.
+  **Cons**: Optimized for decoder-only transformers (e.g., LLaMA), less effective for non-transformer models, requires compatible hardware (e.g., NVIDIA GPUs) and setup effort.
+  **Recommendation**: Top choice for high-volume, low-latency LLM inference on EKS, maximizing scalability and performance.

## Optimizing container image pull times
<a name="_optimizing_container_image_pull_times"></a>

Large container images can cause cold start delays that impact pod start-up latency. For latency-sensitive workloads, like real-time inference workloads scaled horizontally, quick pod startup is critical. Consider the following approaches to optimize container image pull times:

### Reducing Container Image Sizes
<a name="_reducing_container_image_sizes"></a>

Reducing the size of container images during startup is another way to make images smaller. You can make reductions at every step of the container image build process. To start, choose base images that contain the least number of dependencies required. During image builds, include only the essential libraries and artifacts that are required. When building images, try combining multiple `RUN` or `COPY` commands to create a smaller number of larger layers. For AI/ML frameworks, use multi-stage builds to separate build and runtime, copying only required artifacts (e.g., via `COPY —from=` for registries or local contexts), and select variants like runtime-only images (e.g., `pytorch/pytorch:2.7.1-cuda11.8-cudnn9-runtime` at 3.03 GB vs. devel at 6.66 GB). To learn more, see [Reducing container image size](https://awslabs.github.io/ai-on-eks/docs/guidance/container-startup-time/reduce-container-image-size) in the AI on EKS Workshop.

### Using SOCI snapshotter to Pre-pull Images
<a name="_using_soci_snapshotter_to_pre_pull_images"></a>

For very large images that you can’t easily minimize, you can use the open source Seekable OCI (SOCI) snapshotter configured in parallel pull and unpack mode. This solution lets you use existing images without rebuilding or modifying your build pipelines. This option is especially effective when deploying workloads with very large images to high performance EC2 compute instances. It works well with high-throughput networking and high performance storage configurations as is typical with scaled AI/ML workloads.

SOCI parallel pull/unpack mode improves end-to-end image pull performance through configurable parallelization strategies. Faster image pulls and preparation directly impact how quickly you can deploy new workloads and scale your cluster efficiently. Image pulls have two main phases:

 **1. Fetching layers from the registry to the node**   
For layer fetch optimization, SOCI creates multiple concurrent HTTP connections per layer, multiplying download throughput beyond the single-connection limitation. It splits large layers into chunks and downloads them simultaneously across multiple connections. This approach helps saturate your available network bandwidth and reduce download times significantly. This is particularly valuable for AI/ML workloads where a single layer can be several gigabytes.

 **2. Unpacking and preparing those layers to create containers**   
For layer unpacking optimization, SOCI processes multiple layers simultaneously. Instead of waiting for each layer to fully unpack before starting the next, it uses your available CPU cores to decompress and extract multiple layers concurrently. This parallel processing transforms the traditionally I/O-bound unpacking phase into a CPU-optimized operation that scales with your available cores. The system carefully orchestrates this parallelization to maintain filesystem consistency while maximizing throughput.

SOCI parallel pull mode uses a dual-threshold control system with configurable parameters for both download concurrency and unpacking parallelism. This granular control lets you fine-tune SOCI’s behavior to meet your specific performance requirements and environment conditions. Understanding these parameters helps you optimize your runtime for the best pull performance.

 **References** 
+ For more information on the solution and tuning tradeoffs, see the [feature documentation](https://github.com/awslabs/soci-snapshotter/blob/main/docs/parallel-mode.md) in the [SOCI project repository](https://github.com/awslabs/soci-snapshotter) on GitHub.
+ For a hands-on example with Karpenter on Amazon EKS, see the [Karpenter Blueprint using SOCI snapshotter parallel pull/unpack mode](https://github.com/aws-samples/karpenter-blueprints/tree/main/blueprints/soci-snapshotter).
+ For information on configuring Bottlerocket for parallel pull, see [soci-snapshotter Parallel Pull Unpack Mode](https://bottlerocket.dev/en/os/1.44.x/api/settings/container-runtime-plugins/#tag-soci-parallel-pull-configuration) in the Bottlerocket Documentation.o

### Using EBS Snapshots to Pre-pull Images
<a name="_using_ebs_snapshots_to_pre_pull_images"></a>

You can take an Amazon Elastic Block Store (EBS) snapshot of cached container images and reuse this snapshot for EKS worker nodes. This ensures images are prefetched locally upon node startup, reducing pod initialization time. See [Reduce container startup time on Amazon EKS with Bottlerocket data volume](https://aws.amazon.com/blogs/containers/reduce-container-startup-time-on-amazon-eks-with-bottlerocket-data-volume/) for more information using Karpenter and [EKS Terraform Blueprints for managed node groups](https://aws-ia.github.io/terraform-aws-eks-blueprints/patterns/machine-learning/ml-container-cache/).

To learn more, see [Using containerd snapshotter](https://awslabs.github.io/ai-on-eks/docs/guidance/container-startup-time/accelerate-pull-process/containerd-snapshotter) and [Preload container images into Bottlerocket data volumes with EBS Snapshots](https://awslabs.github.io/ai-on-eks/docs/guidance/container-startup-time/accelerate-pull-process/prefecthing-images-on-br) in the AI on EKS Workshop.

### Using the Container Runtime Cache to Pre-pull Images
<a name="_using_the_container_runtime_cache_to_pre_pull_images"></a>

You can pre-pull container images onto nodes using Kubernetes resources (e.g., DaemonSet or Deployment) to populate the node’s container runtime cache. The container runtime cache is the local storage managed by the container runtime (e.g., [containerd](https://containerd.io/) where images are stored after being pulled from a registry. Pre-pulling ensures images are available locally, avoiding download delays during pod startup. This approach is particularly useful when images change often (e.g., frequent updates), when EBS snapshots are not preconfigured, when building an EBS volume would be more time-consuming than direct pulling from a container registry, or when nodes are already in the cluster and need to spin up pods on-demand using one of several possible images.

Pre-pulling all variants ensures fast startup time regardless of which image is needed. For example, in a massively parallel ML workload requiring 100,000 small models built using 10 different techniques, pre-pulling 10 images via DaemonSet across a large cluster (e.g., thousands of nodes) minimizes pod startup time, enabling completion in under 10 seconds by avoiding on-demand pulls. Using the container runtime cache approach eliminates the need to manage EBS snapshots, ensures you always get the latest container image version with DaemonSets, but for real-time inference workloads where nodes scale in/out, new nodes added by tools like Cluster Autoscaler may schedule workload pods before the pre-pull DaemonSet completes image pulling. This can cause the initial pod on the new node to trigger the pull anyway, potentially delaying startup and impacting low-latency requirements. Additionally, kubelet image garbage collection can affect pre-pulled images by removing unused ones when disk usage exceeds certain thresholds or if they exceed a configured maximum unused age. In scale-in/out patterns, this may evict images on idle nodes, which requires re-pulls during subsequent scale-ups and reducing the reliability of the cache for bursty workloads.

See [AWS GitHub repository](https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/prepull) for examples of pre-pulling images into the container runtime cache.

## Consider NVMe for kubelet and containerd storage
<a name="_consider_nvme_for_kubelet_and_containerd_storage"></a>

Consider configuring `kubelet` and `containerd` to use ephemeral NVMe instance storage disks for higher disk performance. The container pull process involves downloading a container image from a registry and decompressing its layers into a usable format. To optimize I/O operations during decompression, you should evaluate what provides higher levels of I/O performance and throughput for your container host’s instance type: [NVMe backed-instances](https://docs.aws.amazon.com/en_us/documentdb/latest/developerguide/db-instance-nvme.html) with local storage vs. EBS Volume IOPS/throughput. For EC2 instances with NVMe local storage, consider configuring the node’s underlying filesystem for kubelet (`/var/lib/kubelet`), containerd (`/var/lib/containerd`) and Pod logs (`/var/log/pods`) to use ephemeral NVMe instance storage disks for higher levels of I/O performance and throughput.

The node’s ephemeral storage can be shared among Pods that request ephemeral storage and container images that are downloaded to the node. If using Karpenter with Bottlerocket or AL2023 EKS Optimized AMIs this can be configured in the [EC2NodeClass](https://karpenter.sh/docs/concepts/nodeclasses/#specinstancestorepolicy) by setting instanceStorePolicy to [RAID0](https://docs.aws.amazon.com/ebs/latest/userguide/raid-config.html) or, if using Managed Node Groups, by setting the localStoragePolicy in [NodeConfig](https://eksctl.io/usage/node-bootstrapping/#configuring-the-bootstrapping-process) as part of user data.