GPU Resource Optimization and Cost Management Node Resiliency and Training Job Management Application Scaling and Performance Dynamic resource allocation for advanced GPU management

Compute and Autoscaling

GPU Resource Optimization and Cost Management

Schedule workloads with GPU requirements using Well-Known labels

For AI/ML workloads sensitive to different GPU characteristics (e.g. GPU, GPU memory) we recommend specifying GPU requirements using known scheduling labels supported by node types used with Karpenter and managed node groups. Failing to define these can result in pods being scheduled on instances with inadequate GPU resources, causing failures or degraded performance. We recommend using nodeSelector or Node affinity to specify which node a pod should run on and setting compute resources (CPU, memory, GPUs etc) in the pod’s resources section.

Example

For example, using GPU name node selector when using Karpenter:


apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: ml-workload
    image: <image>
    resources:
      limits:
        nvidia.com/gpu: 1  # Request one NVIDIA GPU
  nodeSelector:
    karpenter.k8s.aws/instance-gpu-name: "l40s"  # Run on nodes with NVIDIA L40S GPUs

Use Kubernetes Device Plugin for exposing GPUs

To expose GPUs on nodes, the NVIDIA GPU driver must be installed on the node’s operating system and container runtime configured to allow the Kubernetes scheduler to assign pods to nodes with available GPUs. The setup process for the NVIDIA Kubernetes Device Plugin depends on the EKS Accelerated AMI you are using:

Bottlerocket Accelerated AMI : This AMI includes the NVIDIA GPU driver and the NVIDIA Kubernetes Device Plugin is pre-installed and ready to use, enabling GPU support out of the box. No additional configuration is required to expose GPUs to the Kubernetes scheduler.
AL2023 Accelerated AMI : This AMI includes NVIDIA GPU driver but the NVIDIA Kubernetes Device Plugin is not pre-installed. You must install and configure the device plugin separately, typically via a DaemonSet. Note that if you use eksctl to create your cluster and specify a GPU instance type (e.g., g5.xlarge) in your ClusterConfig, eksctl will automatically select the appropriate AMI and install the NVIDIA Kubernetes Device Plugin. To learn more, see GPU support in eksctl documentation.

If you decide to use the EKS Accelerated AMIs and NVIDIA GPU operator to manage components such as the NVIDIA Kubernetes device plugin instead, take note to disable management of the NVIDIA GPU driver and NVIDIA Container toolkit as per the Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit NVIDIA documentation.

To verify that the NVIDIA Device Plugin is active and GPUs are correctly exposed, run:


kubectl describe node | grep nvidia.com/gpu

This command checks if the nvidia.com/gpu resource is in the node’s capacity and allocatable resources. For example, a node with one GPU should show nvidia.com/gpu: 1. See the Kubernetes GPU Scheduling Guide for more information.

Use many different EC2 instance types

Using as many different EC2 instance types as possible is an important best practice for scalability on Amazon EKS, as outlined in the Kubernetes Data Plane section. This recommendation also applies to instances with accelerated hardware (e.g., GPUs). If you create a cluster that uses only one instance type and try to scale the number of nodes beyond the capacity of the region, you may receive an insufficient capacity error (ICE), indicating that no instances are available. It’s important to understand the unique characteristics of your AI/ML workloads before diversifying arbitrarily. Review the available instance types using the EC2 Instance Type Explorer tool to generate a list of instance types that match your specific compute requirements, and avoid arbitrarily limiting the type of instances that can be used in your cluster.

Accelerated compute instances are offered in different purchase models to fit short term, medium term and steady state workloads. For short term, flexible and fault tolerant workloads, where you’d like to avoid making a reservation, look into Spot instances. Capacity Blocks, On-Demand instances and Saving Plans allow you to provision accelerated compute instances for medium and long term workload duration. To increase the chances of successfully accessing the required capacity in your preferred purchase option, it’s recommended to use a diverse list of instance types and availability zones. Alternatively, if you encounter ICEs for a specific purchase model, retry using a different model.

Example The following example shows how to enable a Karpenter NodePool to provision G and P instances greater than generations 3 (e.g., p3). To learn more, see the EKS Scalability best practices section.


- key: karpenter.k8s.aws/instance-category
  operator: In
  values: ["g", "p"] # Diversifies across G-series and P-series
- key: karpenter.k8s.aws/instance-generation
  operator: Gt
  values: ["3"] # Selects instance generations greater than 3

For details on using Spot instances for GPUs, see "Consider using Amazon EC2 Spot Instances for GPUs with Karpenter" below.

Consider using Amazon EC2 Spot Instances for GPUs with Karpenter

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Amazon EC2 Spot Instances can be interrupted with a two-minute notice when EC2 needs the capacity back. For more information, see Spot Instances in the Amazon EC2 User Guide. Amazon EC2 Spot can be a great choice for fault-tolerant, stateless and flexible (time and instance type) workloads. To learn more about when to use Spot instances, see EC2 Spot Instances Best Practices. You can also use Spot Instances for AI/ML workloads if they’re Spot-friendly.

Use cases

Spot-friendly workloads can be big data, containerized workloads, CI/CD, stateless web servers, high performance computing (HPC), and rendering workloads. Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes (e.g., workloads with parallel processes that depend heavily on each other for computation, requiring constant inter-node communication, such as MPI-based high-performance computing applications like computational fluid dynamics or distributed databases with complex interdependencies). Here are the specific use cases we recommend (in no particular order):

Real-time online inference: Use Spot instances for cost-optimized scaling for your real-time inference workloads, as long as your workloads are spot-friendly. In other words, the inference time is either less than two minutes, the application is fault-tolerant to interruptions, and can run on different instance types. Ensure high availability through instance diversity (e.g., across multiple instance types and Availability Zones) or reservations, while implementing application-level fault tolerance to handle potential Spot interruptions.
Hyper-parameter tuning: Use Spot instances to run exploratory tuning jobs opportunistically, as interruptions can be tolerated without significant loss, especially for short-duration experiments.
Data augmentation: Use Spot instances to perform data preprocessing and augmentation tasks that can restart from checkpoints if interrupted, making them ideal for Spot’s variable availability.
Fine-tuning models: Use Spot instances for fine-tuning with robust checkpointing mechanisms to resume from the last saved state, minimizing the impact of instance interruptions.
Batch inference: Use Spot instances to process large batches of offline inference requests in a non-real-time manner, where jobs can be paused and resumed, offering the best alignment with Spot’s cost savings and handling potential interruptions through retries or diversification.
Opportunistic training subsets: Use Spot instances for marginal or experimental training workloads (e.g., smaller models under 10 million parameters), where interruptions are acceptable and efficiency optimizations like diversification across instance types or regions can be applied—though not recommended for production-scale training due to potential disruptions.

Considerations

To use Spot Instances for accelerated workloads on Amazon EKS, there are a number of key considerations (in no particular order):

Use Karpenter to manage Spot instances with advanced consolidation enabled. By specifying karpenter.sh/capacity-type as "spot" in your Karpenter NodePool, Karpenter will provision Spot instances by default without any additional configuration. However, to enable advanced Spot-to-Spot consolidation, which replaces underutilized Spot nodes with lower-priced Spot alternatives, you need to enable the SpotToSpotConsolidation feature gate by setting --feature-gates SpotToSpotConsolidation=true in Karpenter controller arguments or via the FEATURE_GATES environment variable. Karpenter uses the price-capacity-optimized allocation strategy to provision EC2 instances. Based on the NodePool requirements and pod constraints, Karpenter bin-packs unschedulable pods and sends a diverse set of instance types to the Amazon EC2 Fleet API. You can use the EC2 Instance Type Explorer tool to generate a list of instance types that match your specific compute requirements.
Ensure workloads are stateless, fault-tolerance and flexible. Workloads must be stateless, fault-tolerant, and flexible in terms of instance/GPU size. This allows seamless resumption after Spot interruptions, and instance flexibility enables you to potentially stay on Spot for longer. Enable Spot interruption handling in Karpenter by configuring the settings.interruptionQueue Helm value with the name of the AWS SQS queue to catch Spot interruption events. For example, when installing via Helm, use --set "settings.interruptionQueue=${CLUSTER_NAME}". To see an example, see the Getting Started with Karpenter guide. When Karpenter notices a Spot interruption event, it automatically cordons, taints, drains, and terminates the node(s) ahead of the interruption event to maximize the termination grace period of the pods. At the same time, Karpenter will immediately start a new node so it can be ready as soon as possible.
Avoid overly constraining instance type selection. You should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost. For example, avoid limiting to specific types (e.g., g5.xlarge). Consider specifying a diverse set of instance categories and generations using keys like karpenter.k8s.aws/instance-category and karpenter.k8s.aws/instance-generation. Karpenter enables easier diversification of on-demand and Spot instance capacity across multiple instance types and Availability Zones (AZs). Moreover, if your AI/ML workload requires specific or limited number of accelerators but is flexible between regions, you can use Spot Placement Score to dynamically identify the optimal region to deploy your workload before launch.
Broaden NodePool requirements to include a larger number of similar EC2 instance families. Every Spot Instance pool consists of an unused EC2 instance capacity for a specific instance type in a specific Availability Zone (AZ). When Karpenter tries to provision a new node, it selects an instance type that matches the NodePool’s requirements. If no compatible instance type has Spot capacity in any AZ, then provisioning fails. To avoid this issue, allow broader g-series instances (generation 4 or higher) from NVIDIA across sizes and Availability Zones (AZs), while considering hardware needs like GPU memory or Ray Tracing. As instances can be of different types, you need to make sure that your workload is able to run on each type, and the performance you get meets your needs.
Leverage all availability zones in a region. Available capacity varies by Availability Zone (AZ), a specific instance type might be unavailable in one AZ but plentiful in another. Each unique combination of an instance type and an Availability Zone constitutes a separate Spot capacity pool. By requesting capacity across all AZs in a region within your Karpenter NodePool requirements, you are effectively searching more pools at once. This maximizes the number of Spot capacity pools and therefore increases the probability of acquiring Spot capacity. To achieve this, in your NodePool configuration, either omit the topology.kubernetes.io/zone key entirely to allow Karpenter to select from all available AZs in the region, or explicitly list AZs using the operator: In and provide the values (e.g., us-west-2a).
Consider using Spot Placement Score (SPS) to get visibility into the likelihood of successfully accessing the required capacity using Spot instances. Spot Placement Score (SPS) is a tool that provides a score to help you assess how likely a Spot request is to succeed. When you use SPS, you first specify your compute requirements for your Spot Instances, and then Amazon EC2 returns the top 10 Regions or Availability Zones (AZs) where your Spot request is likely to succeed. Regions and Availability Zones are scored on a scale from 1 to 10. A score of 10 indicates that your Spot request is highly likely but not guaranteed to succeed. A score of 1 indicates that your Spot request is not likely to succeed at all. The same score might be returned for different Regions or Availability Zones. To learn more, see Guidance for Building a Spot Placement Score Tracker Dashboard on AWS. As Spot capacity fluctuates all the time, SPS will help you to identify which combination of instance types, AZs, and regions work best for your workload constraints (i.e. flexibility, performance, size, etc.). If your AI/ML workload requires specific or a limited number of accelerators but is flexible between regions, you can use Spot placement score to dynamically identify the optimal region to deploy your workload before launch. To help you find out automatically the likelihood of acquiring Spot capacity, we provide a guidance for building an SPS tracker dashboard. This solution monitors SPS scores over time using a YAML configuration for diversified setups (e.g., instance requirements including GPUs), stores metrics in CloudWatch, and provides dashboards to compare configurations. Define dashboards per workload to evaluate vCPU, memory, and GPU needs, ensuring optimal setups for EKS clusters including the consideration of using other AWS Regions. To learn more, see How Spot placement score works.
Gracefully handle Spot interruptions and test. For a pod with a termination period longer than two minutes, the old node will be interrupted prior to those pods being rescheduled, which could impact workload availability. Consider the two-minute Spot interruption notice when designing your applications, implement checkpointing in long-running applications (e.g., saving progress to persistent storage like Amazon S3) to resume after interruptions, extend the terminationGracePeriodSeconds (default is 30 seconds) in Pod specifications to allow more time for graceful shutdown, and handle interruptions using preStop lifecycle hooks and/or SIGTERM signals within your application for graceful shutdown activities like cleanup, state saving, and connection closure. For real-time workloads, where scaling time is important and workloads take longer than two-minutes for the application to be ready to serve traffic, consider optimizing container start-up and ML model loading times by reviewing Storage and Application Scaling and Performance best practices. To test a replacement node, use AWS Fault Injection Service (FIS) to simulate Spot interruptions.

In addition to these core Spot best practices, take these factors into account when managing GPU workloads on Amazon EKS. Unlike CPU-based workloads, GPU workloads are particularly sensitive to hardware details such as GPU capabilities and available GPU memory. GPU workloads might be constrained by the instance types they can use, with fewer options available compared to CPUs. As a first step, assess if your workload is instance flexible. If you don’t know how many instance types your workload can use, test them individually to ensure compatibility and functionality. Identify how flexible you can be to diversify as much as possible, while confirming that diversification keeps the workload working and understanding any performance impacts (e.g., on throughput or completion time). As part of diversifying your workloads, consider the following:

Review CUDA and framework compatibility. Your GPU workloads might be optimized for specific hardware, GPU types (e.g., V100 in p3 vs. A100 in p4), or written for specific CUDA versions for libraries like TensorFlow, so be sure to review compatibility for your workloads. This compatibility is crucial to prevent runtime errors, crashes, failures in GPU acceleration (e.g., mismatched CUDA versions with frameworks like PyTorch or TensorFlow can prevent execution), or the ability to leverage hardware features like FP16/INT8 precision.
GPU Memory. Be sure to evaluate your models' memory requirements and profile your model’s memory usage during runtime using tools like the DCGM Exporter and set the minimum GPU memory required for the instance type in well-known labels like karpenter.k8s.aws/instance-gpu-memory. GPU VRAM varies across instance types (e.g., NVIDIA T4 has 16GB, A10G has 24GB, V100 has 16-32GB), and ML models (e.g., large language models) can exceed available memory, causing out-of-memory (OOM) errors or crashes. For Spot Instances in EKS, this may limit diversification. For instance, you can’t include lower-VRAM types if your model doesn’t fit, which may limit access to capacity pools and increase interruption risk. Note that for single GPU, single node inference (e.g., multiple pods scheduled on the same node to utilize its GPU resources), this might limit diversification, as you can only include instance types with sufficient VRAM in your Spot configuration.
Floating-point precision and performance. Not all Nvidia GPU architectures have the same floating point precision (e.g., FP16/INT8). Evaluate core types (CUDA/Tensor/RT) performance and floating point precision required for your workloads. Running on a lower priced, less performant GPU does not mean it’s better, so consider evaluating performance in terms of work completed within a specific time frame to understand impact of diversification.

Scenario: Diversification for real time inference workloads

For a real-time online inference workload on Spot Instances, you can configure a Karpenter NodePool to diversify across compatible GPU instance families and generations. This approach ensures high availability by drawing from multiple Spot pools, while maintaining performance through constraints on GPU capabilities, memory, and architecture. It supports using alternatives when instance capacity is constrained, minimizing interruptions and optimizing for inference latency. This example NodePool states, use g and p series instances greater than 3, which have more than 20GB GPU memory.

Example


apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference-spot
spec:
  template:
    metadata:
      labels:
        role: gpu-spot-worker
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"] # Use Spot Instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g", "p"] # Diversifies across G-series and P-series
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"] # Selects instance generations greater than 3
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"] # Specifies AMD64 architecture, compatible with NVIDIA GPUs
        - key: karpenter.k8s.aws/instance-gpu-memory
          operator: Gt
          values: ["20480"] # Ensures more than 20GB (20480 MiB) total GPU memory
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      nodeClassRef:
        name: gpu-inference-ec2
        group: karpenter.k8s.aws
        kind: EC2NodeClass
      expireAfter: 720h
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m # Enables consolidation of underutilized nodes after 5 minutes

Implement Checkpointing for Long Running Training Jobs

Checkpointing is a fault-tolerance technique that involves periodically saving the state of a process, allowing it to resume from the last saved point in case of interruptions. In machine learning, it is commonly associated with training, where long-running jobs can save model weights and optimizer states to resume training after failures, such as hardware issues or Spot Instance interruptions.

You use checkpoints to save the state of machine learning (ML) models during training. Checkpoints are snapshots of the model and can be configured by the callback functions of ML frameworks. You can use the saved checkpoints to restart a training job from the last saved checkpoint. Using checkpoints, you save your model snapshots under training due to an unexpected interruption to the training job or instance. This allows you to resume training the model in the future from a checkpoint. In addition to implementing a node resiliency system, we recommend implementing checkpointing to mitigate the impact of interruptions, including those caused by hardware failures or Amazon EC2 Spot Instance interruptions.

Without checkpointing, interruptions can result in wasted compute time and lost progress, which is costly for long-running training jobs. Checkpointing allows jobs to save their state periodically (e.g., model weights and optimizer states) and resume from the last checkpoint (last processed batch) after an interruption. To implement checkpointing, design your application to process data in large batches and save intermediate results to persistent storage, such as an Amazon S3 bucket via the Mountpoint for Amazon S3 CSI Driver while the training job progresses.

Use cases

Checkpointing is particularly beneficial in specific scenarios to balance fault tolerance with performance overhead. Consider using checkpointing in the following cases:

Job duration exceeds a few hours: For long-running training jobs (e.g., >1-2 hours for small models, or days/weeks for large foundation models with billions of parameters), where progress loss from interruptions is costly. Shorter jobs may not justify the I/O overhead.
For Spot instances or hardware failures: In environments prone to interruptions, such as EC2 Spot (2-minute notice) or hardware failures (e.g., GPU memory errors), checkpointing enables quick resumption, making Spot viable for cost savings in fault-tolerant workloads.
Distributed training at scale: For setups with hundreds/thousands of accelerators (e.g., >100 GPUs), where mean time between failures decreases linearly with scale. Use for model/data parallelism to handle concurrent checkpoint access and avoid complete restarts.
Large-scale models with high resource demands: In petabyte-scale LLM training, where failures are inevitable due to cluster size; tiered approaches (fast local every 5-30 minutes for transients, durable hourly for major failures) optimize recovery time vs. efficiency.

Use ML Capacity Blocks for capacity assurance of P and Trainium instances

Capacity Blocks for ML allow you to reserve highly sought-after GPU instances, specifically P instances (e.g., p6-b200, p5, p5e, p5en, p4d, p4de) and Trainium instances (e.g., trn1, trn2), to start either almost immediately or on a future date to support your short duration machine learning (ML) workloads. These reservations are ideal for ensuring capacity for compute-intensive tasks like model training and fine-tuning. EC2 Capacity Blocks pricing consists of a reservation fee and an operating system fee. To learn more about pricing, see EC2 Capacity Blocks for ML pricing.

To reserve GPUs for AI/ML workloads on Amazon EKS for predicable capacity assurance we recommend leveraging ML Capacity Blocks for short-term or On-Demand Capacity Reservations (ODCRs) for general-purpose capacity assurance.

ODCRs allow you to reserve EC2 instance capacity (e.g., GPU instances like g5 or p5) in a specific Availability Zone for a duration, ensuring availability, even during high demand. ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. In EKS, ODCRs are supported by node types like Karpenter and managed node groups. To prioritize ODCRs in Karpenter, configure the NodeClass to use the capacityReservationSelectorTerms field. See the Karpenter NodePools Documentation.
Capacity Blocks are a specialized reservation mechanism for GPU (e.g., p5, p4d) or Trainium (trn1, trn2) instances, designed for short-term ML workloads like model training, fine-tuning, or experimentation. You reserve capacity for a defined period (typically 24 hours to 182 days) starting on a future date, paying only for the reserved time. They are pre-paid, require pre-planning for capacity needs and do not support autoscaling, but they are colocated in EC2 UltraClusters for low-latency networking. They charge only for the reserved period. To learn more, refer to Find and purchase Capacity Blocks, or get started by setting up managed node groups with Capacity Blocks using the instructions in Create a managed node group with Capacity Blocks for ML.

Reserve capacity via the AWS Management Console and configure your nodes to use ML capacity blocks. Plan reservations based on workload schedules and test in a staging cluster. Refer to the Capacity Blocks Documentation for more information.

Consider On-Demand, Amazon EC2 Spot or On-Demand Capacity Reservations (ODCRs) for G Amazon EC2 instances

For G Amazon EC2 Instances consider the different purchase options from On-Demand, Amazon EC2 Spot Instances and On-Demand Capacity Reservations. ODCRs allow you to reserve EC2 instance capacity in a specific Availability Zone for a certain duration, ensuring availability even during high demand. Unlike ML Capacity Blocks, which are only available to P and Trainium instances, ODCRs can be used for a wider range of instance types, including G instances, making them suitable for workloads that require different GPU capabilities, such as inference or graphics. When using Amazon EC2 Spot Instances, being able to diverse across different instance types, sizes, and availability zones is key to being able to stay on Spot for longer.

ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. ODCRs can be created for immediate use or scheduled for a future date, providing flexibility in capacity planning. In Amazon EKS, ODCRs are supported by node types like Karpenter and managed node groups. To prioritize ODCRs in Karpenter, configure the NodeClass to use the capacityReservationSelectorTerms field. See the Karpenter NodePools Documentation. For more information on creating ODCRs, including CLI commands, refer to the On-Demand Capacity Reservation Getting Started.

Consider other accelerated instance types and sizes

Selecting the appropriate accelerated instance and size is essential for optimizing both performance and cost in your ML workloads on Amazon EKS. For example, different GPU instance families have different performance and capabilities such as GPU memory. To help you choose the most price-performant option, review the available GPU instances in the EC2 Instance Types page under Accelerated Computing. Evaluate multiple instance types and sizes to find the best fit for your specific workload requirements. Consider factors such as the number of GPUs, memory, and network performance. By carefully selecting the right GPU instance type and size, you can achieve better resource utilization and cost efficiency in your EKS clusters.

If you use a GPU instance in an EKS node then it will have the nvidia-device-plugin-daemonset pod in the kube-system namespace by default. To get a quick sense of whether you are fully utilizing the GPU(s) in your instance, you can use nvidia-smi as shown here:


kubectl exec nvidia-device-plugin-daemonset-xxxxx \
  -n kube-system -- nvidia-smi \
  --query-gpu=index,power.draw,power.limit,temperature.gpu,utilization.gpu,utilization.memory,memory.free,memory.used \
  --format=csv -l 5

If utilization.memory is close to 100%, then your code(s) are likely memory bound. This means that the GPU (memory) is fully utilized but could suggest that further performance optimization should be investigated.
If the utilization.gpu is close to 100%, this does not necessarily mean the GPU is fully utilized. A better metric to look at is the ratio of power.draw to power.limit. If this ratio is 100% or more, then your code(s) are fully utilizing the compute capacity of the GPU.
The -l 5 flag says to output the metrics every 5 seconds. In the case of a single GPU instance type, the index query flag is not needed.

To learn more, see GPU instances in AWS documentation.

Optimize GPU Resource Allocation with Time-Slicing, MIG, and Fractional GPU Allocation

Static resource limits in Kubernetes (e.g., CPU, memory, GPU counts) can lead to over-provisioning or underutilization, particularly for dynamic AI/ML workloads like inference. Selecting the right GPU is important. For low-volume or spiky workloads, time-slicing allows multiple workloads to share a single GPU by sharing its compute resources, potentially improving efficiency and reducing waste. GPU sharing can be achieved through different options:

Leverage Node Selectors / Node affinity to influence scheduling: Ensure the nodes provisioned and pods are scheduled on the appropriate GPUs for the workload (e.g., karpenter.k8s.aws/instance-gpu-name: "a100")
Time-Slicing: Schedules workloads to share a GPU’s compute resources over time, allowing concurrent execution without physical partitioning. This is ideal for workloads with variable compute demands, but may lack memory isolation.
Multi-Instance GPU (MIG): MIG allows a single NVIDIA GPU to be partitioned into multiple, isolated instances and is supported with NVIDIA Ampere (e.g., A100 GPU), NVIDIA Hopper (e.g., H100 GPU), and NVIDIA Blackwell (e.g., Blackwell GPUs) GPUs. Each MIG instance receives dedicated compute and memory resources, enabling resource sharing in multi-tenant environments or workloads requiring resource guarantees, which allows you to optimize GPU resource utilization, including scenarios like serving multiple models with different batch sizes through time-slicing.
Fractional GPU Allocation: Uses software-based scheduling to allocate portions of a GPU’s compute or memory to workloads, offering flexibility for dynamic workloads. The NVIDIA KAI Scheduler, part of the Run:ai platform, enables this by allowing pods to request fractional GPU resources.

To enable these features in EKS, you can deploy the NVIDIA Device Plugin, which exposes GPUs as schedulable resources and supports time-slicing and MIG. To learn more, see Time-Slicing GPUs in Kubernetes and GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances.

Example

For example, to enable time-slicing with the NVIDIA Device Plugin:


apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 pods to share each GPU

Example

For example, to use KAI Scheduler for fractional GPU allocation, deploy it alongside the NVIDIA GPU Operator and specify fractional GPU resources in the pod spec:


apiVersion: v1
kind: Pod
metadata:
  name: fractional-gpu-pod-example
  annotations:
    gpu-fraction: "0.5"  # Annotation for 50% GPU
  labels:
    runai/queue: "default"  # Required queue assignment
spec:
  containers:
  - name: ml-workload
    image: nvcr.io/nvidia/pytorch:25.04-py3
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"
  schedulerName: kai-scheduler

Node Resiliency and Training Job Management

Implement Node Health Checks with Automated Recovery

For distributed training jobs on Amazon EKS that require frequent inter-node communication, such as multi-GPU model training across multiple nodes, hardware issues like GPU or EFA failures can cause disruptions to training jobs. These disruptions can lead to loss of training progress and increased costs, particularly for long-running AI/ML workloads that rely on stable hardware.

To help add resilience against hardware failures, such as GPU failures in EKS clusters running GPU workloads, we recommend leveraging either the EKS Node Monitoring Agent with Auto Repair or Amazon SageMaker HyperPod. While the EKS Node Monitoring Agent with Auto Repair provides features like node health monitoring and auto-repair using standard Kubernetes mechanisms, SageMaker HyperPod offers targeted resilience and additional features specifically designed for large-scale ML training, such as deep health checks and automatic job resumption.

The EKS Node Monitoring Agent with Node Auto Repair continuously monitors node health by reading logs and applying NodeConditions, including standard conditions like Ready and conditions specific to accelerated hardware to identify issues like GPU or networking failures. When a node is deemed unhealthy, Node Auto Repair cordons it and replaces it with a new node. The rescheduling of pods and restarting of jobs rely on standard Kubernetes mechanisms and the job’s restart policy.
The SageMaker HyperPod deep health checks and health-monitoring agent continuously monitors the health status of GPU and Trainium-based instances. It is tailored for AI/ML workloads, using labels (e.g., node-health-status) to manage node health. When a node is deemed unhealthy, HyperPod triggers automatic replacement of the faulty hardware, such as GPUs. It detects networking-related failures for EFA through its basic health checks by default and supports auto-resume for interrupted training jobs, allowing jobs to continue from the last checkpoint, minimizing disruptions for large-scale ML tasks.

For both EKS Node Monitoring Agent with Auto Repair and SageMaker HyperPod clusters using EFA, to monitor EFA-specific metrics such as Remote Direct Memory Access (RDMA) errors and packet drops, make sure the AWS EFA driver is installed. In addition, we recommend deploying the CloudWatch Observability Add-on or using tools like DCGM Exporter with Prometheus and Grafana to monitor EFA, GPU, and, for SageMaker HyperPod, specific metrics related to its features.

Disable Karpenter Consolidation for interruption sensitive Workloads

For workload sensitive to interruptions, such as processing, large-scale AI/ML prediction tasks or training, we recommend tuning Karpenter consolidation policies to prevent disruptions during job execution. Karpenter’s consolidation feature automatically optimizes cluster costs by terminating underutilized nodes or replacing them with lower-priced alternatives. However, even when a workload fully utilizes a GPU, Karpenter may consolidate nodes if it identifies a lower-priced right-sized instance type that meets the pod’s requirements, leading to job interruptions.

The WhenEmptyOrUnderutilized consolidation policy may terminate nodes prematurely, leading to longer execution times. For example, interruptions may delay job resumption due to pod rescheduling, data reloading, which could be costly for long-running batch inference jobs. To mitigate this, you can set the consolidationPolicy to WhenEmpty and configure a consolidateAfter duration, such as 1 hour, to retain nodes during workload spikes. For example:


disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 60m

This approach improves pod startup latency for spiky batch inference workloads and other interruption-sensitive jobs, such as real-time online inference data processing or model training, where the cost of interruption outweighs compute cost savings. Karpenter NodePool Disruption Budgets is another feature for managing Karpenter disruptions. With budgets, you can make sure that no more than a certain number of nodes nodes will be disrupted in the chosen NodePool at a point in time. You can also use disruption budgets to prevent all nodes from being disrupted at a certain time (e.g. peak hours). To learn more, see Karpenter Consolidation documentation.

Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs

We recommend setting ttlSecondsAfterFinished for Kubernetes jobs in Amazon EKS to automatically delete completed job objects. Lingering job objects consume cluster resources, such as API server memory, and complicate monitoring by cluttering dashboards (e.g., Grafana, Amazon CloudWatch). For example, setting a TTL of 1 hour ensures jobs are removed shortly after completion, keeping your cluster tidy. For more details, refer to Automatic Cleanup for Finished Jobs.

Configure Low-Priority Job Preemption for Higher-Priority Jobs/workloads

For mixed-priority AI/ML workloads on Amazon EKS, you may configure low-priority job preemption to ensure higher-priority tasks (e.g., real-time inference) receive resources promptly. Without preemption, low-priority workloads such as batch processes (e.g., batch inference, data processing), non-batch services (e.g., background tasks, cron jobs), or CPU/memory-intensive jobs (e.g., web services) can delay critical pods by occupying nodes. Preemption allows Kubernetes to evict low-priority pods when high-priority pods need resources, ensuring efficient resource allocation on nodes with GPUs, CPUs, or memory. We recommend using Kubernetes PriorityClass to assign priorities and PodDisruptionBudget to control eviction behavior.


apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
---
spec:
  priorityClassName: low-priority

See the Kubernetes Priority and Preemption Documentation for more information.

Application Scaling and Performance

Tailor Compute Capacity for ML workloads with Karpenter or Static Nodes

To ensure cost-efficient and responsive compute capacity for machine learning (ML) workflows on Amazon EKS, we recommend tailoring your node provisioning strategy to your workload’s characteristics and cost commitments. Below are two approaches to consider: just-in-time scaling with Karpenter and static node groups for reserved capacity.

Just-in-time data plane scalers like Karpenter: For dynamic ML workflows with variable compute demands (e.g., GPU-based inference followed by CPU-based plotting), we recommend using just-in-time data plane scalers like Karpenter.
Use static node groups for predictable workloads: For predictable, steady-state ML workloads or when using Reserved instances, EKS managed node groups can help ensure reserved capacity is fully provisioned and utilized, maximizing savings. This approach is ideal for specific instance types committed via RIs or ODCRs.

Example

This is an example of a diverse Karpenter NodePool that enables launching of g Amazon EC2 instances where instance generation is greater than three.


apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    cpu: "1000"
    memory: "4000Gi"
    nvidia.com/gpu: "10"  *# Limit the total number of GPUs to 10 for the NodePool*
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60m
    expireAfter: 720h

Example

Example using static node groups for a training workload:


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ml-cluster
  region: us-west-2
managedNodeGroups:
  - name: gpu-node-group
    instanceType: p4d.24xlarge
    minSize: 2
    maxSize: 2
    desiredCapacity: 2
    taints:
      - key: nvidia.com/gpu
        effect: NoSchedule

Use taints and tolerations to prevent non-accelerated workloads from being scheduled on accelerated instances

Scheduling non accelerated workloads on GPU resources is not compute-efficient, we recommend using taints and toleration to ensure non accelerated workloads pods are not scheduled on inappropriate nodes. See the Kubernetes documentation for more information.

Scale Based on Model Performance

For inference workloads, we recommend using Kubernetes Event-Driven Autoscaling (KEDA) to scale based on model performance metrics like inference requests or token throughput, with appropriate cooldown periods. Static scaling policies may over- or under-provision resources, impacting cost and latency. Learn more in the KEDA Documentation.

Dynamic resource allocation for advanced GPU management

Dynamic resource allocation (DRA) represents a fundamental advancement in Kubernetes GPU resource management. DRA moves beyond traditional device plugin limitations to enable sophisticated GPU sharing, topology awareness, and cross-node resource coordination. Available in Amazon EKS version 1.33, DRA addresses critical challenges in AI/ML workloads by providing the following:

Fine-grained GPU allocation
Advanced sharing mechanisms, such as Multi-Process service (MPS) and Multi-Instance GPU (MIG)
Support for next-generation hardware architectures, including NVIDIA GB200 UltraServers

Traditional GPU allocation treats GPUs as opaque integer resources, creating significant under-utilization (often 30-40% in production clusters). This occurs because workloads receive exclusive access to entire GPUs even when requiring only fractional resources. DRA transforms this model by introducing structured, declarative allocation that provides the Kubernetes scheduler with complete visibility into hardware characteristics and workload requirements. This enables intelligent placement decisions and efficient resource sharing.

Advantages of using DRA instead of NVIDIA device plugin

The NVIDIA device plugin (starting from version 0.12.0) supports GPU sharing mechanisms including time-slicing, MPS, and MIG. However, architectural limitations exist that DRA addresses.

NVIDIA device plugin limitations

Static configuration: GPU sharing configurations (time-slicing replicas and MPS settings) require pre-configuration cluster-wide through ConfigMaps. This makes providing different sharing strategies for different workloads difficult.
Limited granular selection: While the device plugin exposes GPU characteristics through node labels, workloads cannot dynamically request specific GPU configurations (memory size and compute capabilities) as part of the scheduling decision.
No cross-node resource coordination: Cannot manage distributed GPU resources across multiple nodes or express complex topology requirements like NVLink domains for systems like NVIDIA GB200.
Scheduler constraints: The Kubernetes scheduler treats GPU resources as opaque integers, limiting its ability to make topology-aware decisions or handle complex resource dependencies.
Configuration complexity: Setting up different sharing strategies requires multiple ConfigMaps and careful node labeling, creating operational complexity.

Solutions with DRA

Dynamic resource selection: DRA allows workloads to specify detailed requirements (GPU memory, driver versions, and specific attributes) at request time through resourceclaims. This enables more flexible resource matching.
Topology awareness: Through structured parameters and device selectors, DRA handles complex requirements like cross-node GPU communication and memory-coherent interconnects.
Cross-node resource management: computeDomains enable coordination of distributed GPU resources across multiple nodes, critical for systems like GB200 with IMEX channels.
Workload-specific configuration: Each ResourceClaim specifies different sharing strategies and configurations, allowing fine-grained control per workload rather than cluster-wide settings.
Enhanced scheduler integration: DRA provides the scheduler with detailed device information and enables more intelligent placement decisions based on hardware topology and resource characteristics.

Important: DRA does not replace the NVIDIA device plugin entirely. The NVIDIA DRA driver works alongside the device plugin to provide enhanced capabilities. The device plugin continues to handle basic GPU discovery and management, while DRA adds advanced allocation and scheduling features.

Instances supported by DRA and their features

DRA support varies by Amazon EC2 instance family and GPU architecture, as shown in the following table.

Instance family	GPU type	Time-slicing	MIG support	MPS support	IMEX support	Use cases
G5	NVIDIA A10G	Yes	No	Yes	No	Inference and graphics workloads
G6	NVIDIA L4	Yes	No	Yes	No	AI inference and video processing
G6e	NVIDIA L40S	Yes	No	Yes	No	Training, inference, and graphics
P4d/P4de	NVIDIA A100	Yes	Yes	Yes	No	Large-scale training and HPC
P5	NVIDIA H100	Yes	Yes	Yes	No	Foundation model training
P6	NVIDIA B200	Yes	Yes	Yes	No	Billion or trillion-parameter models, distributed training, and inference
P6e	NVIDIA GB200	Yes	Yes	Yes	Yes	Billion or trillion-parameter models, distributed training, and inference

The following are descriptions of each feature in the table:

Time-slicing: Allows multiple workloads to share GPU compute resources over time.
Multi-Instance GPU (MIG): Hardware-level partitioning that creates isolated GPU instances.
Multi-Process service (MPS): Enables concurrent execution of multiple CUDA processes on a single GPU.
Internode Memory Exchange (IMEX): Memory-coherent communication across nodes for GB200 UltraServers.

Additional resources

For more information about Kubernetes DRA and NVIDIA DRA drivers, see the following resources on GitHub:

Set up dynamic resource allocation for advanced GPU management

The following topic shows you how to setup dynamic resource allocation (DRA) for advanced GPU management.

Prerequisites

Before implementing DRA on Amazon EKS, ensure your environment meets the following requirements.

Cluster configuration

Amazon EKS cluster running version 1.33 or later
Amazon EKS managed node groups (DRA is currently supported only by managed node groups with AL2023 and Bottlerocket NVIDIA optimized AMIs, not with Karpenter)
NVIDIA GPU-enabled worker nodes with appropriate instance types

Required components

NVIDIA device plugin version 0.17.1 or later
NVIDIA DRA driver version 25.3.0 or later

Step 1: Create cluster with DRA-enabled node group using eksctl

Create a cluster configuration file named dra-eks-cluster.yaml:


---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dra-eks-cluster
  region: us-west-2
  version: '1.33'

managedNodeGroups:
- name: gpu-dra-nodes
  amiFamily: AmazonLinux2023
  instanceType: g6.12xlarge
  desiredCapacity: 2
  minSize: 1
  maxSize: 3

  labels:
    node-type: "gpu-dra"
    nvidia.com/gpu.present: "true"

  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

Create the cluster:


eksctl create cluster -f dra-eks-cluster.yaml

Step 2: Deploy the NVIDIA device plugin

Deploy the NVIDIA device plugin to enable basic GPU discovery:

Add the NVIDIA device plugin Helm repository:


helm repo add nvidia https://nvidia.github.io/k8s-device-plugin
helm repo update

Create custom values for the device plugin:


cat <<EOF > nvidia-device-plugin-values.yaml
gfd:
  enabled: true
nfd:
  enabled: true
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

Install the NVIDIA device plug-in:


helm install nvidia-device-plugin nvidia/nvidia-device-plugin \
 --namespace nvidia-device-plugin \
 --create-namespace \
 --version v0.17.1 \
 --values nvidia-device-plugin-values.yaml

Step 3: Deploy NVIDIA DRA driver Helm chart

Create a dra-driver-values.yaml values file for the DRA driver:


---
nvidiaDriverRoot: /

gpuResourcesEnabledOverride: true

resources:
  gpus:
    enabled: true
  computeDomains:
    enabled: true  # Enable for GB200 IMEX support

controller:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

kubeletPlugin:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: "nvidia.com/gpu.present"
            operator: In
            values: ["true"]
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Add the NVIDIA NGC Helm repository:


helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the NVIDIA DRA driver:


helm install nvidia-dra-driver nvidia/nvidia-dra-driver-gpu \
 --version="25.3.0-rc.2" \
 --namespace nvidia-dra-driver \
 --create-namespace \
 --values dra-driver-values.yaml

Step 4: Verify the DRA installation

Verify that the DRA API resources are available:


kubectl api-resources | grep resource.k8s.io/v1beta1

The following is the expected output:


deviceclasses resource.k8s.io/v1beta1 false DeviceClass
resourceclaims resource.k8s.io/v1beta1 true ResourceClaim
resourceclaimtemplates resource.k8s.io/v1beta1 true ResourceClaimTemplate
resourceslices resource.k8s.io/v1beta1 false ResourceSlice

Check the available device classes:
```
kubectl get deviceclasses
```
The following is an example of expected output:
```
NAME                                        AGE
compute-domain-daemon.nvidia.com            4h39m
compute-domain-default-channel.nvidia.com   4h39m
gpu.nvidia.com                              4h39m
mig.nvidia.com                              4h39m
```
When a newly created G6 GPU instance joins your Amazon EKS cluster with DRA enabled, the following actions occur:
- The NVIDIA DRA driver automatically discovers the A10G GPU and creates two resourceslices on that node.
- The gpu.nvidia.com slice registers the physical A10G GPU device with its specifications (memory, compute capability, and more).
- Since A10G doesn’t support MIG partitioning, the compute-domain.nvidia.com slice creates a single compute domain representing the entire compute context of the GPU.
- These resourceslices are then published to the Kubernetes API server, making the GPU resources available for scheduling through resourceclaims.
  
  The DRA scheduler can now intelligently allocate this GPU to Pods that request GPU resources through resourceclaimtemplates, providing more flexible resource management compared to traditional device plugin approaches. This happens automatically without manual intervention. The node simply becomes available for GPU workloads once the DRA driver completes the resource discovery and registration process.
  
  When you run the following command:
```
kubectl get resourceslices
```
  The following is an example of expected output:
```
NAME                                                          NODE                             DRIVER                       POOL                             AGE
ip-100-64-129-47.ec2.internal-compute-domain.nvidia.com-rwsts ip-100-64-129-47.ec2.internal    compute-domain.nvidia.com    ip-100-64-129-47.ec2.internal    35m
ip-100-64-129-47.ec2.internal-gpu.nvidia.com-6kndg            ip-100-64-129-47.ec2.internal    gpu.nvidia.com               ip-100-64-129-47.ec2.internal    35m
```

Continue to Schedule a simple GPU workload using dynamic resource allocation.

Schedule a simple GPU workload using dynamic resource allocation

To schedule a simple GPU workload using dynamic resource allocation (DRA), do the following steps. Before proceeding, make sure you have followed Set up dynamic resource allocation for advanced GPU management.

Create a basic ResourceClaimTemplate for GPU allocation with a file named basic-gpu-claim-template.yaml:


---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test1

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test1
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com

Apply the template:


kubectl apply -f basic-gpu-claim-template.yaml

Verify the status:


kubectl get resourceclaimtemplates -n gpu-test1

The following is example output:


NAME         AGE
single-gpu   9m16s

Create a Pod that uses the ResourceClaimTemplate with a file named basic-gpu-pod.yaml:


---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: gpu-pod
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu0
  resourceClaims:
  - name: gpu0
    resourceClaimTemplateName: single-gpu
  nodeSelector:
    NodeGroupType: gpu-dra
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Apply and monitor the Pod:
```
kubectl apply -f basic-gpu-pod.yaml
```

Check the Pod status:


kubectl get pod -n gpu-test1

The following is example expected output:


NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   1/1     Running   0          13m

Check the ResourceClaim status:


kubectl get resourceclaims -n gpu-test1

The following is example expected output:


NAME                 STATE                AGE
gpu-pod-gpu0-l76cg   allocated,reserved   9m6s

View Pod logs to see GPU information:


kubectl logs gpu-pod -n gpu-test1

The following is example expected output:


GPU 0: NVIDIA L4 (UUID: GPU-da7c24d7-c7e3-ed3b-418c-bcecc32af7c5)

Continue to GPU optimization techniques with dynamic resource allocation for more advanced GPU optimization techniques using DRA.

GPU optimization techniques with dynamic resource allocation

Modern GPU workloads require sophisticated resource management to achieve optimal utilization and cost efficiency. DRA enables several advanced optimization techniques that address different use cases and hardware capabilities:

Time-slicing allows multiple workloads to share GPU compute resources over time, making it ideal for inference workloads with sporadic GPU usage. For an example, see Optimize GPU workloads with time-slicing.
Multi-Process service (MPS) enables concurrent execution of multiple CUDA processes on a single GPU with better isolation than time-slicing. For an example, see Optimize GPU workloads with MPS.
Multi-Instance GPU (MIG) provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources. For an example, see Optimize GPU workloads with Multi-Instance GPU.
Internode Memory Exchange (IMEX) enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 systems. For an example, see Optimize GPU workloads with IMEX using GB200 P6e instances.

These techniques can significantly improve resource utilization. Organizations report GPU utilization increases from 30-40% with traditional allocation to 80-90% with optimized sharing strategies. The choice of technique depends on workload characteristics, isolation requirements, and hardware capabilities.

Optimize GPU workloads with time-slicing

Time-slicing enables multiple workloads to share GPU compute resources by scheduling them to run sequentially on the same physical GPU. It is ideal for inference workloads with sporadic GPU usage.

Do the following steps.

Define a ResourceClaimTemplate for time-slicing with a file named timeslicing-claim-template.yaml:


---
apiVersion: v1
kind: Namespace
metadata:
  name: timeslicing-gpu

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: timeslicing-gpu-template
  namespace: timeslicing-gpu
spec:
  spec:
    devices:
      requests:
      - name: shared-gpu
        deviceClassName: gpu.nvidia.com
      config:
      - requests: ["shared-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: TimeSlicing

Define a Pod using time-slicing with a file named timeslicing-pod.yaml:


---
# Pod 1 - Inference workload
apiVersion: v1
kind: Pod
metadata:
  name: inference-pod-1
  namespace: timeslicing-gpu
  labels:
    app: gpu-inference
spec:
  restartPolicy: Never
  containers:
  - name: inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "-c"]
    args:
    - |
      import torch
      import time
      import os
      print(f"=== POD 1 STARTING ===")
      print(f"GPU available: {torch.cuda.is_available()}")
      print(f"GPU count: {torch.cuda.device_count()}")
      if torch.cuda.is_available():
          device = torch.cuda.current_device()
          print(f"Current GPU: {torch.cuda.get_device_name(device)}")
          print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
          # Simulate inference workload
          for i in range(20):
              x = torch.randn(1000, 1000).cuda()
              y = torch.mm(x, x.t())
              print(f"Pod 1 - Iteration {i+1} completed at {time.strftime('%H:%M:%S')}")
              time.sleep(60)
      else:
          print("No GPU available!")
          time.sleep(5)
    resources:
      claims:
      - name: shared-gpu-claim
  resourceClaims:
  - name: shared-gpu-claim
    resourceClaimTemplateName: timeslicing-gpu-template
  nodeSelector:
    NodeGroupType: "gpu-dra"
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule


---
# Pod 2 - Training workload
apiVersion: v1
kind: Pod
metadata:
  name: training-pod-2
  namespace: timeslicing-gpu
  labels:
    app: gpu-training
spec:
  restartPolicy: Never
  containers:
  - name: training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "-c"]
    args:
    - |
      import torch
      import time
      import os
      print(f"=== POD 2 STARTING ===")
      print(f"GPU available: {torch.cuda.is_available()}")
      print(f"GPU count: {torch.cuda.device_count()}")
      if torch.cuda.is_available():
          device = torch.cuda.current_device()
          print(f"Current GPU: {torch.cuda.get_device_name(device)}")
          print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
          # Simulate training workload with heavier compute
          for i in range(15):
              x = torch.randn(2000, 2000).cuda()
              y = torch.mm(x, x.t())
              loss = torch.sum(y)
              print(f"Pod 2 - Training step {i+1}, Loss: {loss.item():.2f} at {time.strftime('%H:%M:%S')}")
              time.sleep(5)
      else:
          print("No GPU available!")
          time.sleep(60)
    resources:
      claims:
      - name: shared-gpu-claim-2
  resourceClaims:
  - name: shared-gpu-claim-2
    resourceClaimTemplateName: timeslicing-gpu-template
  nodeSelector:
    NodeGroupType: "gpu-dra"
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Apply the template and Pod:


kubectl apply -f timeslicing-claim-template.yaml
kubectl apply -f timeslicing-pod.yaml

Monitor resource claims:


kubectl get resourceclaims -n timeslicing-gpu -w

The following is example output:


NAME                                      STATE                AGE
inference-pod-1-shared-gpu-claim-9p97x    allocated,reserved   21s
training-pod-2-shared-gpu-claim-2-qghnb   pending              21s
inference-pod-1-shared-gpu-claim-9p97x    pending              105s
training-pod-2-shared-gpu-claim-2-qghnb   pending              105s
inference-pod-1-shared-gpu-claim-9p97x    pending              105s
training-pod-2-shared-gpu-claim-2-qghnb   allocated,reserved   105s
inference-pod-1-shared-gpu-claim-9p97x    pending              105s

First Pod (inference-pod-1)

State: allocated,reserved
Meaning: DRA found an available GPU and reserved it for this Pod
Pod status: Starts running immediately

Second Pod (training-pod-2)

State: pending
Meaning: Waiting for DRA to configure time-slicing on the same GPU
Pod status: Waiting to be scheduled
The state will go from pending to allocated,reserved to running

Optimize GPU workloads with MPS

Multi-Process Service (MPS) enables concurrent execution of multiple CUDA contexts on a single GPU with better isolation than time-slicing.

Do the following steps.

Define a ResourceClaimTemplate for MPS with a file named mps-claim-template.yaml:


---
apiVersion: v1
kind: Namespace
metadata:
  name: mps-gpu

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mps-gpu-template
  namespace: mps-gpu
spec:
  spec:
    devices:
      requests:
      - name: shared-gpu
        deviceClassName: gpu.nvidia.com
      config:
      - requests: ["shared-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: MPS

Define a Pod using MPS with a file named mps-pod.yaml:


---
# Single Pod with Multiple Containers sharing GPU via MPS
apiVersion: v1
kind: Pod
metadata:
  name: mps-multi-container-pod
  namespace: mps-gpu
  labels:
    app: mps-demo
spec:
  restartPolicy: Never
  containers:
  # Container 1 - Inference workload
  - name: inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "-c"]
    args:
    - |
      import torch
      import torch.nn as nn
      import time
      import os

      print(f"=== INFERENCE CONTAINER STARTING ===")
      print(f"Process ID: {os.getpid()}")
      print(f"GPU available: {torch.cuda.is_available()}")
      print(f"GPU count: {torch.cuda.device_count()}")

      if torch.cuda.is_available():
          device = torch.cuda.current_device()
          print(f"Current GPU: {torch.cuda.get_device_name(device)}")
          print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")

          # Create inference model
          model = nn.Sequential(
              nn.Linear(1000, 500),
              nn.ReLU(),
              nn.Linear(500, 100)
          ).cuda()

          # Run inference
          for i in range(1, 999999):
              with torch.no_grad():
                  x = torch.randn(128, 1000).cuda()
                  output = model(x)
                  result = torch.sum(output)
                  print(f"Inference Container PID {os.getpid()}: Batch {i}, Result: {result.item():.2f} at {time.strftime('%H:%M:%S')}")
              time.sleep(2)
      else:
          print("No GPU available!")
          time.sleep(60)
    resources:
      claims:
      - name: shared-gpu-claim
        request: shared-gpu

  # Container 2 - Training workload
  - name: training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "-c"]
    args:
    - |
      import torch
      import torch.nn as nn
      import time
      import os

      print(f"=== TRAINING CONTAINER STARTING ===")
      print(f"Process ID: {os.getpid()}")
      print(f"GPU available: {torch.cuda.is_available()}")
      print(f"GPU count: {torch.cuda.device_count()}")

      if torch.cuda.is_available():
          device = torch.cuda.current_device()
          print(f"Current GPU: {torch.cuda.get_device_name(device)}")
          print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")

          # Create training model
          model = nn.Sequential(
              nn.Linear(2000, 1000),
              nn.ReLU(),
              nn.Linear(1000, 500),
              nn.ReLU(),
              nn.Linear(500, 10)
          ).cuda()

          criterion = nn.MSELoss()
          optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

          # Run training
          for epoch in range(1, 999999):
              x = torch.randn(64, 2000).cuda()
              target = torch.randn(64, 10).cuda()

              optimizer.zero_grad()
              output = model(x)
              loss = criterion(output, target)
              loss.backward()
              optimizer.step()

              print(f"Training Container PID {os.getpid()}: Epoch {epoch}, Loss: {loss.item():.4f} at {time.strftime('%H:%M:%S')}")
              time.sleep(3)
      else:
          print("No GPU available!")
          time.sleep(60)
    resources:
      claims:
      - name: shared-gpu-claim
        request: shared-gpu

  resourceClaims:
  - name: shared-gpu-claim
    resourceClaimTemplateName: mps-gpu-template

  nodeSelector:
    NodeGroupType: "gpu-dra"
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Apply the template and create multiple MPS Pods:


kubectl apply -f mps-claim-template.yaml
kubectl apply -f mps-pod.yaml

Monitor the resource claims:


kubectl get resourceclaims -n mps-gpu -w

The following is example output:


NAME                                             STATE                AGE
mps-multi-container-pod-shared-gpu-claim-2p9kx   allocated,reserved   86s

This configuration demonstrates true GPU sharing using NVIDIA Multi-Process Service (MPS) through dynamic resource allocation (DRA). Unlike time-slicing where workloads take turns using the GPU sequentially, MPS enables both containers to run simultaneously on the same physical GPU. The key insight is that DRA MPS sharing requires multiple containers within a single Pod, not multiple separate Pods. When deployed, the DRA driver allocates one ResourceClaim to the Pod and automatically configures MPS to allow both the inference and training containers to execute concurrently.

Each container gets its own isolated GPU memory space and compute resources, with the MPS daemon coordinating access to the underlying hardware. You can verify this is working by doing the following:

Checking nvidia-smi, which will show both containers as M+C (MPS + Compute) processes sharing the same GPU device.
Monitoring the logs from both containers, which will display interleaved timestamps proving simultaneous execution.

This approach maximizes GPU utilization by allowing complementary workloads to share the expensive GPU hardware efficiently, rather than leaving it underutilized by a single process.

Container1: `inference-container`


root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:09:30 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   48C    P0             28W /   72W |     597MiB /  23034MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1    M+C   python                                  246MiB |
+-----------------------------------------------------------------------------------------+

Container2: `training-container`


root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:16:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:35:00.0 Off |                    0 |
| N/A   51C    P0             28W /   72W |     597MiB /  23034MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1    M+C   python                                  314MiB |
+-----------------------------------------------------------------------------------------+

Optimize GPU workloads with Multi-Instance GPU

Multi-instance GPU (MIG) provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources.

Using dynamic MIG partitioning with various profiles requires the NVIDIA GPU Operator. The NVIDIA GPU Operator uses MIG Manager to create MIG profiles and reboots the GPU instances like P4D, P4De, P5, P6, and more to apply the configuration changes. The GPU Operator includes comprehensive MIG management capabilities through the MIG Manager component, which watches for node label changes and automatically applies the appropriate MIG configuration. When a MIG profile change is requested, the operator gracefully shuts down all GPU clients, applies the new partition geometry, and restarts the affected services. This process requires a node reboot for GPU instances to ensure clean GPU state transitions. This is why enabling WITH–0—REBOOT=true in the MIG Manager configuration is essential for successful MIG deployments.

You need both NVIDIA DRA Driver and NVIDIA GPU Operator to work with MIG in Amazon EKS. You don’t need NVIDIA Device Plugin and DCGM Exporter in addition to this as these are part of the NVIDIA GPU Operator. Since the EKS NVIDIA AMIs come with the NVIDIA Drivers pre-installed, we disabled the deployment of drivers by the GPU Operator to avoid conflicts and leverage the optimized drivers already present on the instances. The NVIDIA DRA Driver handles dynamic resource allocation for MIG instances, while the GPU Operator manages the entire GPU lifecycle. This includes MIG configuration, device plugin functionality, monitoring through DCGM, and node feature discovery. This integrated approach provides a complete solution for enterprise GPU management, with hardware-level isolation and dynamic resource allocation capabilities.

Step 1: Deploy NVIDIA GPU Operator

Add the NVIDIA GPU Operator repository:


helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

Create a gpu-operator-values.yaml file:


driver:
  enabled: false

mig:
  strategy: mixed

migManager:
  enabled: true
  env:
    - name: WITH_REBOOT
      value: "true"
  config:
    create: true
    name: custom-mig-parted-configs
    default: "all-disabled"
    data:
      config.yaml: |-
        version: v1
        mig-configs:
          all-disabled:
            - devices: all
              mig-enabled: false

          # P4D profiles (A100 40GB)
          p4d-half-balanced:
            - devices: [0, 1, 2, 3]
              mig-enabled: true
              mig-devices:
                "1g.5gb": 2
                "2g.10gb": 1
                "3g.20gb": 1
            - devices: [4, 5, 6, 7]
              mig-enabled: false

          # P4DE profiles (A100 80GB)
          p4de-half-balanced:
            - devices: [0, 1, 2, 3]
              mig-enabled: true
              mig-devices:
                "1g.10gb": 2
                "2g.20gb": 1
                "3g.40gb": 1
            - devices: [4, 5, 6, 7]
              mig-enabled: false

devicePlugin:
  enabled: true
  config:
    name: ""
    create: false
    default: ""

toolkit:
  enabled: true

nfd:
  enabled: true

gfd:
  enabled: true

dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 15s
    honorLabels: false
    additionalLabels:
      release: kube-prometheus-stack

nodeStatusExporter:
  enabled: false

operator:
  defaultRuntime: containerd
  runtimeClass: nvidia
  resources:
    limits:
      cpu: 500m
      memory: 350Mi
    requests:
      cpu: 200m
      memory: 100Mi

daemonsets:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  nodeSelector:
    accelerator: nvidia
  priorityClassName: system-node-critical

Install GPU Operator using the gpu-operator-values.yaml file:
```
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --version v25.3.1 \
  --values gpu-operator-values.yaml
```
This Helm chart deploys the following components and multiple MIG profiles:
- Device Plugin (GPU resource scheduling)
- DCGM Exporter (GPU metrics and monitoring)
- Node Feature Discovery (NFD - hardware labeling)
- GPU Feature Discovery (GFD - GPU-specific labeling)
- MIG Manager (Multi-instance GPU partitioning)
- Container Toolkit (GPU container runtime)
- Operator Controller (lifecycle management)

Verify the deployment Pods:


kubectl get pods -n gpu-operator

The following is example output:


NAME                                                              READY   STATUS      RESTARTS        AGE
gpu-feature-discovery-27rdq                                       1/1     Running     0               3h31m
gpu-operator-555774698d-48brn                                     1/1     Running     0               4h8m
nvidia-container-toolkit-daemonset-sxmh9                          1/1     Running     1 (3h32m ago)   4h1m
nvidia-cuda-validator-qb77g                                       0/1     Completed   0               3h31m
nvidia-dcgm-exporter-cvzd7                                        1/1     Running     0               3h31m
nvidia-device-plugin-daemonset-5ljm5                              1/1     Running     0               3h31m
nvidia-gpu-operator-node-feature-discovery-gc-67f66fc557-q5wkt    1/1     Running     0               4h8m
nvidia-gpu-operator-node-feature-discovery-master-5d8ffddcsl6s6   1/1     Running     0               4h8m
nvidia-gpu-operator-node-feature-discovery-worker-6t4w7           1/1     Running     1 (3h32m ago)   4h1m
nvidia-gpu-operator-node-feature-discovery-worker-9w7g8           1/1     Running     0               4h8m
nvidia-gpu-operator-node-feature-discovery-worker-k5fgs           1/1     Running     0               4h8m
nvidia-mig-manager-zvf54                                          1/1     Running     1 (3h32m ago)   3h35m

Create an Amazon EKS cluster with a p4De managed node group for testing the MIG examples:


apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dra-eks-cluster
  region: us-east-1
  version: '1.33'

managedNodeGroups:
# P4DE MIG Node Group with Capacity Block Reservation
- name: p4de-mig-nodes
  amiFamily: AmazonLinux2023
  instanceType: p4de.24xlarge

  # Capacity settings
  desiredCapacity: 0
  minSize: 0
  maxSize: 1

  # Use specific subnet in us-east-1b for capacity reservation
  subnets:
    - us-east-1b

  # AL2023 NodeConfig for RAID0 local storage only
  nodeadmConfig:
    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      instance:
        localStorage:
          strategy: RAID0

  # Node labels for MIG configuration
  labels:
    nvidia.com/gpu.present: "true"
    nvidia.com/gpu.product: "A100-SXM4-80GB"
    nvidia.com/mig.config: "p4de-half-balanced"
    node-type: "p4de"
    vpc.amazonaws.com/efa.present: "true"
    accelerator: "nvidia"

  # Node taints
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule

  # EFA support
  efaEnabled: true

  # Placement group for high-performance networking
  placementGroup:
    groupName: p4de-placement-group
    strategy: cluster

  # Capacity Block Reservation (CBR)
  # Ensure CBR ID matches the subnet AZ with the Nodegroup subnet
  spot: false
  capacityReservation:
    capacityReservationTarget:
      capacityReservationId: "cr-abcdefghij"  # Replace with your capacity reservation ID

NVIDIA GPU Operator uses the label added to nodes nvidia.com/mig.config: "p4de-half-balanced" and partitions the GPU with the given profile.

Run the following command:


nvidia-smi -L

You should see the following example output:


[root@ip-100-64-173-145 bin]# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-ab52e33c-be48-38f2-119e-b62b9935925a)
  MIG 3g.40gb     Device  0: (UUID: MIG-da972af8-a20a-5f51-849f-bc0439f7970e)
  MIG 2g.20gb     Device  1: (UUID: MIG-7f9768b7-11a6-5de9-a8aa-e9c424400da4)
  MIG 1g.10gb     Device  2: (UUID: MIG-498adad6-6cf7-53af-9d1a-10cfd1fa53b2)
  MIG 1g.10gb     Device  3: (UUID: MIG-3f55ef65-1991-571a-ac50-0dbf50d80c5a)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-0eabeccc-7498-c282-0ac7-d3c09f6af0c8)
  MIG 3g.40gb     Device  0: (UUID: MIG-80543849-ea3b-595b-b162-847568fe6e0e)
  MIG 2g.20gb     Device  1: (UUID: MIG-3af1958f-fac4-59f1-8477-9f8d08c55029)
  MIG 1g.10gb     Device  2: (UUID: MIG-401088d2-716f-527b-a970-b1fc7a4ac6b2)
  MIG 1g.10gb     Device  3: (UUID: MIG-8c56c75e-5141-501c-8f43-8cf22f422569)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-1c7a1289-243f-7872-a35c-1d2d8af22dd0)
  MIG 3g.40gb     Device  0: (UUID: MIG-e9b44486-09fc-591a-b904-0d378caf2276)
  MIG 2g.20gb     Device  1: (UUID: MIG-ded93941-9f64-56a3-a9b1-a129c6edf6e4)
  MIG 1g.10gb     Device  2: (UUID: MIG-6c317d83-a078-5c25-9fa3-c8308b379aa1)
  MIG 1g.10gb     Device  3: (UUID: MIG-2b070d39-d4e9-5b11-bda6-e903372e3d08)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-9a6250e2-5c59-10b7-2da8-b61d8a937233)
  MIG 3g.40gb     Device  0: (UUID: MIG-20e3cd87-7a57-5f1b-82e7-97b14ab1a5aa)
  MIG 2g.20gb     Device  1: (UUID: MIG-04430354-1575-5b42-95f4-bda6901f1ace)
  MIG 1g.10gb     Device  2: (UUID: MIG-d62ec8b6-e097-5e99-a60c-abf8eb906f91)
  MIG 1g.10gb     Device  3: (UUID: MIG-fce20069-2baa-5dd4-988a-cead08348ada)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-5d09daf0-c2eb-75fd-3919-7ad8fafa5f86)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-99194e04-ab2a-b519-4793-81cb2e8e9179)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-c1a1910f-465a-e16f-5af1-c6aafe499cd6)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-c2cfafbc-fd6e-2679-e955-2a9e09377f78)

NVIDIA GPU Operator has successfully applied the p4de-half-balanced MIG profile to your P4DE instance, creating hardware-level GPU partitions as configured. Here’s how the partitioning works:

The GPU Operator applied this configuration from your embedded MIG profile:


p4de-half-balanced:
  - devices: [0, 1, 2, 3]        # First 4 GPUs: MIG enabled
    mig-enabled: true
    mig-devices:
      "1g.10gb": 2               # 2x small instances (10GB each)
      "2g.20gb": 1               # 1x medium instance (20GB)
      "3g.40gb": 1               # 1x large instance (40GB)
  - devices: [4, 5, 6, 7]        # Last 4 GPUs: Full GPUs
    mig-enabled: false

From your nvidia-smi -L output, here’s what the GPU Operator created:

MIG-enabled GPUs (0-3): hardware partitioned
- GPU 0: NVIDIA A100-SXM4-80GB
  - MIG 3g.40gb Device 0 – Large workloads (40GB memory, 42 SMs)
  - MIG 2g.20gb Device 1 – Medium workloads (20GB memory, 28 SMs)
  - MIG 1g.10gb Device 2 – Small workloads (10GB memory, 14 SMs)
  - MIG 1g.10gb Device 3 – Small workloads (10GB memory, 14 SMs)
- GPU 1: NVIDIA A100-SXM4-80GB
  - MIG 3g.40gb Device 0 – Identical partition layout
  - MIG 2g.20gb Device 1
  - MIG 1g.10gb Device 2
  - MIG 1g.10gb Device 3
- GPU 2 and GPU 3 – Same pattern as GPU 0 and GPU 1
Full GPUs (4-7): No MIG partitioning
- GPU 4: NVIDIA A100-SXM4-80GB – Full 80GB GPU
- GPU 5: NVIDIA A100-SXM4-80GB – Full 80GB GPU
- GPU 6: NVIDIA A100-SXM4-80GB – Full 80GB GPU
- GPU 7: NVIDIA A100-SXM4-80GB – Full 80GB GPU

Once the NVIDIA GPU Operator creates the MIG partitions, the NVIDIA DRA Driver automatically detects these hardware-isolated instances and makes them available for dynamic resource allocation in Kubernetes. The DRA driver discovers each MIG instance with its specific profile (1g.10gb, 2g.20gb, 3g.40gb) and exposes them as schedulable resources through the mig.nvidia.com device class.

The DRA driver continuously monitors the MIG topology and maintains an inventory of available instances across all GPUs. When a Pod requests a specific MIG profile through a ResourceClaimTemplate, the DRA driver intelligently selects an appropriate MIG instance from any available GPU, enabling true hardware-level multi-tenancy. This dynamic allocation allows multiple isolated workloads to run simultaneously on the same physical GPU while maintaining strict resource boundaries and performance guarantees.

Step 2: Test MIG resource allocation

Now let’s run some examples to demonstrate how DRA dynamically allocates MIG instances to different workloads. Deploy the resourceclaimtemplates and test pods to see how the DRA driver places workloads across the available MIG partitions, allowing multiple containers to share GPU resources with hardware-level isolation.

Create mig-claim-template.yaml to contain the MIG resourceclaimtemplates:


apiVersion: v1
kind: Namespace
metadata:
  name: mig-gpu

---
# Template for 3g.40gb MIG instance (Large training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-large-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-large
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '3g.40gb'

---
# Template for 2g.20gb MIG instance (Medium training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-medium-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-medium
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '2g.20gb'

---
# Template for 1g.10gb MIG instance (Small inference)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-small-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-small
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '1g.10gb'

Apply the three templates:


kubectl apply -f mig-claim-template.yaml

Run the following command:


kubectl get resourceclaimtemplates -n mig-gpu

The following is example output:


NAME                  AGE
mig-large-template    71m
mig-medium-template   71m
mig-small-template    71m

Create mig-pod.yaml to schedule multiple jobs to leverage this resourceclaimtemplates:


---
# ConfigMap containing Python scripts for MIG pods
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-scripts-configmap
  namespace: mig-gpu
data:
  large-training-script.py: |
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import time
    import os

    print(f"=== LARGE TRAINING POD (3g.40gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Large model for 3g.40gb instance
        model = nn.Sequential(
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        optimizer = optim.Adam(model.parameters())
        criterion = nn.CrossEntropyLoss()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Training loop
        for epoch in range(100):
            # Large batch for 3g.40gb
            x = torch.randn(256, 2048).cuda()
            y = torch.randint(0, 10, (256,)).cuda()

            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                print(f"Large Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(3)

        print("Large training completed on 3g.40gb MIG instance")

  medium-training-script.py: |
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import time
    import os

    print(f"=== MEDIUM TRAINING POD (2g.20gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Medium model for 2g.20gb instance
        model = nn.Sequential(
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        optimizer = optim.Adam(model.parameters())
        criterion = nn.CrossEntropyLoss()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Training loop
        for epoch in range(100):
            # Medium batch for 2g.20gb
            x = torch.randn(128, 1024).cuda()
            y = torch.randint(0, 10, (128,)).cuda()

            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                print(f"Medium Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(4)

        print("Medium training completed on 2g.20gb MIG instance")

  small-inference-script.py: |
    import torch
    import torch.nn as nn
    import time
    import os

    print(f"=== SMALL INFERENCE POD (1g.10gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Small model for 1g.10gb instance
        model = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Inference loop
        for i in range(200):
            with torch.no_grad():
                # Small batch for 1g.10gb
                x = torch.randn(32, 512).cuda()
                output = model(x)
                prediction = torch.argmax(output, dim=1)

                if i % 20 == 0:
                    print(f"Small Inference - Batch {i}, Predictions: {prediction[:5].tolist()}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(2)

        print("Small inference completed on 1g.10gb MIG instance")

---
# Pod 1: Large training workload (3g.40gb)
apiVersion: v1
kind: Pod
metadata:
  name: mig-large-training-pod
  namespace: mig-gpu
  labels:
    app: mig-large-training
    workload-type: training
spec:
  restartPolicy: Never
  containers:
  - name: large-training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/large-training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-large-claim
  resourceClaims:
  - name: mig-large-claim
    resourceClaimTemplateName: mig-large-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755

---
# Pod 2: Medium training workload (2g.20gb) - can run on SAME GPU as Pod 1
apiVersion: v1
kind: Pod
metadata:
  name: mig-medium-training-pod
  namespace: mig-gpu
  labels:
    app: mig-medium-training
    workload-type: training
spec:
  restartPolicy: Never
  containers:
  - name: medium-training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/medium-training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-medium-claim
  resourceClaims:
  - name: mig-medium-claim
    resourceClaimTemplateName: mig-medium-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755

---
# Pod 3: Small inference workload (1g.10gb) - can run on SAME GPU as Pod 1 & 2
apiVersion: v1
kind: Pod
metadata:
  name: mig-small-inference-pod
  namespace: mig-gpu
  labels:
    app: mig-small-inference
    workload-type: inference
spec:
  restartPolicy: Never
  containers:
  - name: small-inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/small-inference-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-small-claim
  resourceClaims:
  - name: mig-small-claim
    resourceClaimTemplateName: mig-small-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755

Apply this spec, which should deploy three Pods:
```
kubctl apply -f mig-pod.yaml
```
These Pods should be scheduled by the DRA driver.

Check DRA driver Pod logs and you will see output similar to this:


I0717 21:50:22.925811 1 driver.go:87] NodePrepareResource is called: number of claims: 1
I0717 21:50:22.932499 1 driver.go:129] Returning newly prepared devices for claim '933e9c72-6fd6-49c5-933c-a896407dc6d1': [&Device{RequestNames:[mig-large],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-0-mig-9-4-4,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-0-mig-9-4-4**],}]
I0717 21:50:23.186472 1 driver.go:87] NodePrepareResource is called: number of claims: 1
I0717 21:50:23.191226 1 driver.go:129] Returning newly prepared devices for claim '61e5ddd2-8c2e-4c19-93ae-d317fecb44a4': [&Device{RequestNames:[mig-medium],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-2-mig-14-0-2,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-2-mig-14-0-2**],}]
I0717 21:50:23.450024 1 driver.go:87] NodePrepareResource is called: number of claims: 1
I0717 21:50:23.455991 1 driver.go:129] Returning newly prepared devices for claim '1eda9b2c-2ea6-401e-96d0-90e9b3c111b5': [&Device{RequestNames:[mig-small],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-1-mig-19-2-1,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-1-mig-19-2-1**],}]

Verify the resourceclaims to see the Pod status:


kubectl get resourceclaims -n mig-gpu -w

The following is example output:


NAME                                             STATE                AGE
mig-large-training-pod-mig-large-claim-6dpn8     pending              0s
mig-large-training-pod-mig-large-claim-6dpn8     pending              0s
mig-large-training-pod-mig-large-claim-6dpn8     allocated,reserved   0s
mig-medium-training-pod-mig-medium-claim-bk596   pending              0s
mig-medium-training-pod-mig-medium-claim-bk596   pending              0s
mig-medium-training-pod-mig-medium-claim-bk596   allocated,reserved   0s
mig-small-inference-pod-mig-small-claim-d2t58    pending              0s
mig-small-inference-pod-mig-small-claim-d2t58    pending              0s
mig-small-inference-pod-mig-small-claim-d2t58    allocated,reserved   0s

As you can see, all the Pods moved from pending to allocated,reserved by the DRA driver.

Run nvidia-smi from the node. You will notice three Python processors are running:


root@ip-100-64-173-145 bin]# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:10:1C.0 Off | On |
| N/A 63C P0 127W / 400W | 569MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:10:1D.0 Off | On |
| N/A 56C P0 121W / 400W | 374MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:20:1C.0 Off | On |
| N/A 63C P0 128W / 400W | 467MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:20:1D.0 Off | On |
| N/A 57C P0 118W / 400W | 249MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:90:1C.0 Off | 0 |
| N/A 51C P0 77W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:90:1D.0 Off | 0 |
| N/A 46C P0 69W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:A0:1C.0 Off | 0 |
| N/A 52C P0 74W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 47C P0 72W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+


+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 2 0 0 | 428MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 3 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 9 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 10 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 13 0 2 | 161MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 2MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 2 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 2 5 0 1 | 289MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 2 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 2 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 3 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 3 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 3 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 3 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+


+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
**| 0 2 0 64080 C python 312MiB |
| 1 13 0 64085 C python 118MiB |
| 2 5 0 64073 C python 210MiB |**
+-----------------------------------------------------------------------------------------+

Optimize GPU workloads with IMEX using GB200 P6e instances

IMEX (Internode Memory Exchange) enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 UltraServers.

Do the following steps.

Define a ComputeDomain for multi-node training with a file named imex-compute-domain.yaml:


apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: distributed-training-domain
  namespace: default
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: imex-channel-template

Define a Pod using IMEX channels with a file named imex-pod.yaml:


apiVersion: v1
kind: Pod
metadata:
  name: imex-distributed-training
  namespace: default
  labels:
    app: imex-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: distributed-training
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["bash", "-c"]
    args:
    - |
      echo "=== IMEX Channel Verification ==="
      ls -la /dev/nvidia-caps-imex-channels/
      echo ""

      echo "=== GPU Information ==="
      nvidia-smi
      echo ""

      echo "=== NCCL Test (if available) ==="
      python -c "
      import torch
      import torch.distributed as dist
      import os

      print(f'CUDA available: {torch.cuda.is_available()}')
      print(f'CUDA device count: {torch.cuda.device_count()}')

      if torch.cuda.is_available():
          for i in range(torch.cuda.device_count()):
              print(f'GPU {i}: {torch.cuda.get_device_name(i)}')

      # Check for IMEX environment variables
      imex_vars = [k for k in os.environ.keys() if 'IMEX' in k or 'NVLINK' in k]
      if imex_vars:
          print('IMEX Environment Variables:')
          for var in imex_vars:
              print(f'  {var}={os.environ[var]}')

      print('IMEX channel verification completed')
      "

      # Keep container running for inspection
      sleep 3600
    resources:
      claims:
      - name: imex-channel-0
      - name: imex-channel-1
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-template
  - name: imex-channel-1
    resourceClaimTemplateName: imex-channel-template
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Note

This requires P6e GB200 instances.

Deploy IMEX by applying the ComputeDomain and templates:


kubectl apply -f imex-claim-template.yaml
kubectl apply -f imex-compute-domain.yaml
kubectl apply -f imex-pod.yaml

Check the ComputeDomain status.


kubectl get computedomain distributed-training-domain

Monitor the IMEX daemon deployment.


kubectl get pods -n nvidia-dra-driver -l resource.nvidia.com/computeDomain

Check the IMEX channels in the Pod:


kubectl exec imex-distributed-training -- ls -la /dev/nvidia-caps-imex-channels/

View the Pod logs:


kubectl logs imex-distributed-training

The following is an example of expected output:


=== IMEX Channel Verification ===
total 0
drwxr-xr-x. 2 root root 80 Jul 8 10:45 .
drwxr-xr-x. 6 root root 380 Jul 8 10:45 ..
crw-rw-rw-. 1 root root 241, 0 Jul 8 10:45 channel0
crw-rw-rw-. 1 root root 241, 1 Jul 8 10:45 channel1

For more information, see the NVIDIA example on GitHub.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

AI/ML

Networking

Compute and Autoscaling

GPU Resource Optimization and Cost Management

Schedule workloads with GPU requirements using Well-Known labels

Use Kubernetes Device Plugin for exposing GPUs

Use many different EC2 instance types

Consider using Amazon EC2 Spot Instances for GPUs with Karpenter

Implement Checkpointing for Long Running Training Jobs

Use ML Capacity Blocks for capacity assurance of P and Trainium instances

Consider On-Demand, Amazon EC2 Spot or On-Demand Capacity Reservations (ODCRs) for G Amazon EC2 instances

Consider other accelerated instance types and sizes

Optimize GPU Resource Allocation with Time-Slicing, MIG, and Fractional GPU Allocation

Node Resiliency and Training Job Management

Implement Node Health Checks with Automated Recovery

Disable Karpenter Consolidation for interruption sensitive Workloads

Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs

Configure Low-Priority Job Preemption for Higher-Priority Jobs/workloads

Application Scaling and Performance

Tailor Compute Capacity for ML workloads with Karpenter or Static Nodes

Use taints and tolerations to prevent non-accelerated workloads from being scheduled on accelerated instances

Scale Based on Model Performance

Dynamic resource allocation for advanced GPU management

Advantages of using DRA instead of NVIDIA device plugin

Instances supported by DRA and their features

Additional resources

Set up dynamic resource allocation for advanced GPU management

Prerequisites

Cluster configuration

Required components

Step 1: Create cluster with DRA-enabled node group using eksctl

Step 2: Deploy the NVIDIA device plugin

Step 3: Deploy NVIDIA DRA driver Helm chart

Step 4: Verify the DRA installation

Schedule a simple GPU workload using dynamic resource allocation

GPU optimization techniques with dynamic resource allocation

Optimize GPU workloads with time-slicing

Optimize GPU workloads with MPS

Container1: inference-container

Container2: training-container

Optimize GPU workloads with Multi-Instance GPU

Step 1: Deploy NVIDIA GPU Operator

Step 2: Test MIG resource allocation

Optimize GPU workloads with IMEX using GB200 P6e instances

Note

Container1: `inference-container`

Container2: `training-container`