Compute and Autoscaling
GPU Resource Optimization and Cost Management
Schedule workloads with GPU requirements using Well-Known labels
For AI/ML workloads sensitive to different GPU characteristics (e.g. GPU, GPU memory) we recommend specifying GPU requirements using known scheduling labels
Example
For example, using GPU name node selector when using Karpenter:
apiVersion: v1 kind: Pod metadata: name: gpu-pod-example spec: containers: - name: ml-workload image: <image> resources: limits: nvidia.com/gpu: 1 # Request one NVIDIA GPU nodeSelector: karpenter.k8s.aws/instance-gpu-name: "l40s" # Run on nodes with NVIDIA L40S GPUs
Use Kubernetes Device Plugin for exposing GPUs
To expose GPUs on nodes, the NVIDIA GPU driver must be installed on the node’s operating system and container runtime configured to allow the Kubernetes scheduler to assign pods to nodes with available GPUs. The setup process for the NVIDIA Kubernetes Device Plugin depends on the EKS Accelerated AMI you are using:
-
Bottlerocket Accelerated AMI : This AMI includes the NVIDIA GPU driver and the NVIDIA Kubernetes Device Plugin
is pre-installed and ready to use, enabling GPU support out of the box. No additional configuration is required to expose GPUs to the Kubernetes scheduler. -
AL2023 Accelerated AMI
: This AMI includes NVIDIA GPU driver but the NVIDIA Kubernetes Device Plugin is not pre-installed. You must install and configure the device plugin separately, typically via a DaemonSet. Note that if you use eksctl to create your cluster and specify a GPU instance type (e.g., g5.xlarge
) in your ClusterConfig,eksctl
will automatically select the accelerated AMI and install the NVIDIA Kubernetes Device Plugin on each instance in the node group. To learn more, see GPU supportin eksctl documentation.
To verify that the NVIDIA Device Plugin is active and GPUs are correctly exposed, run:
kubectl describe node | grep nvidia.com/gpu
This command checks if the nvidia.com/gpu
resource is in the node’s capacity and allocatable resources. For example, a node with one GPU should show nvidia.com/gpu: 1
. See the Kubernetes GPU Scheduling Guide
Use ML Capacity Blocks for capacity assurance of P and Trainium instances
Capacity Blocks for ML allow you to reserve highly sought-after GPU instances, specifically P instances (e.g., p6-b200, p5, p5e, p5en, p4d, p4de) and Trainium instances (e.g., trn1, trn2), to start either almost immediately or on a future date to support your short duration machine learning (ML) workloads. These reservations are ideal for ensuring capacity for compute-intensive tasks like model training and fine-tuning. EC2 Capacity Blocks pricing consists of a reservation fee and an operating system fee. To learn more about pricing, see EC2 Capacity Blocks for ML pricing
To reserve GPUs for AI/ML workloads on Amazon EKS for predicable capacity assurance we recommend leveraging ML Capacity Blocks for short-term or On-Demand Capacity Reservations (ODCRs) for general-purpose capacity assurance.
-
ODCRs allow you to reserve EC2 instance capacity (e.g., GPU instances like g5 or p5) in a specific Availability Zone for a duration, ensuring availability, even during high demand. ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. In EKS, ODCRs are supported by node types like Karpenter
and managed node groups. To prioritize ODCRs in Karpenter, configure the NodeClass to use the capacityReservationSelectorTerms
field. See the Karpenter NodePools Documentation. -
Capacity Blocks are a specialized reservation mechanism for GPU (e.g., p5, p4d) or Trainium (trn1, trn2) instances, designed for short-term ML workloads like model training, fine-tuning, or experimentation. You reserve capacity for a defined period (typically 24 hours to 182 days) starting on a future date, paying only for the reserved time. They are pre-paid, require pre-planning for capacity needs and do not support autoscaling, but they are colocated in EC2 UltraClusters for low-latency networking. They charge only for the reserved period. To learn more, refer to Find and purchase Capacity Blocks, or get started by setting up managed node groups with Capacity Blocks using the instructions in Create a managed node group with Capacity Blocks for ML.
Reserve capacity via the AWS Management Console and configure your nodes to use ML capacity blocks. Plan reservations based on workload schedules and test in a staging cluster. Refer to the Capacity Blocks Documentation for more information.
Consider On-Demand, Amazon EC2 Spot or On-Demand Capacity Reservations (ODCRs) for G Amazon EC2 instances
For G Amazon EC2 Instances consider the different purchase options from On-Demand, Amazon EC2 Spot Instances and On-Demand Capacity Reservations. ODCRs allow you to reserve EC2 instance capacity in a specific Availability Zone for a certain duration, ensuring availability even during high demand. Unlike ML Capacity Blocks, which are only available to P and Trainium instances, ODCRs can be used for a wider range of instance types, including G instances, making them suitable for workloads that require different GPU capabilities, such as inference or graphics. When using Amazon EC2 Spot Instances, being able to diverse across different instance types, sizes, and availability zones is key to being able to stay on Spot for longer.
ODCRs have no long-term commitment, but you pay the On-Demand rate for the reserved capacity, whether used or idle. ODCRs can be created for immediate use or scheduled for a future date, providing flexibility in capacity planning. In Amazon EKS, ODCRs are supported by node types like KarpentercapacityReservationSelectorTerms
field. See the Karpenter NodePools Documentation
Consider other accelerated instance types and sizes
Selecting the appropriate accelerated instance and size is essential for optimizing both performance and cost in your ML workloads on Amazon EKS. For example, different GPU instance families have different performance and capabilities such as GPU memory. To help you choose the most price-performant option, review the available GPU instances in the EC2 Instance Types
If you use a GPU instance in an EKS node then it will have the nvidia-device-plugin-daemonset
pod in the kube-system
namespace by default. To get a quick sense of whether you are fully utilizing the GPU(s) in your instance, you can use nvidia-smi
kubectl exec nvidia-device-plugin-daemonset-xxxxx \ -n kube-system -- nvidia-smi \ --query-gpu=index,power.draw,power.limit,temperature.gpu,utilization.gpu,utilization.memory,memory.free,memory.used \ --format=csv -l 5
-
If
utilization.memory
is close to 100%, then your code(s) are likely memory bound. This means that the GPU (memory) is fully utilized but could suggest that further performance optimization should be investigated. -
If the
utilization.gpu
is close to 100%, this does not necessarily mean the GPU is fully utilized. A better metric to look at is the ratio ofpower.draw
topower.limit
. If this ratio is 100% or more, then your code(s) are fully utilizing the compute capacity of the GPU. -
The
-l 5
flag says to output the metrics every 5 seconds. In the case of a single GPU instance type, the index query flag is not needed.
To learn more, see GPU instances in AWS documentation.
Optimize GPU Resource Allocation with Time-Slicing, MIG, and Fractional GPU Allocation
Static resource limits in Kubernetes (e.g., CPU, memory, GPU counts) can lead to over-provisioning or underutilization, particularly for dynamic AI/ML workloads like inference. Selecting the right GPU is important. For low-volume or spiky workloads, time-slicing allows multiple workloads to share a single GPU by sharing its compute resources, potentially improving efficiency and reducing waste. GPU sharing can be achieved through different options:
-
Leverage Node Selectors / Node affinity to influence scheduling: Ensure the nodes provisioned and pods are scheduled on the appropriate GPUs for the workload (e.g.,
karpenter.k8s.aws/instance-gpu-name: "a100"
) -
Time-Slicing: Schedules workloads to share a GPU’s compute resources over time, allowing concurrent execution without physical partitioning. This is ideal for workloads with variable compute demands, but may lack memory isolation.
-
Multi-Instance GPU (MIG): MIG allows a single NVIDIA GPU to be partitioned into multiple, isolated instances and is supported with NVIDIA Ampere (e.g., A100 GPU), NVIDIA Hopper (e.g., H100 GPU), and NVIDIA Blackwell (e.g., Blackwell GPUs) GPUs. Each MIG instance receives dedicated compute and memory resources, enabling resource sharing in multi-tenant environments or workloads requiring resource guarantees, which allows you to optimize GPU resource utilization, including scenarios like serving multiple models with different batch sizes through time-slicing.
-
Fractional GPU Allocation: Uses software-based scheduling to allocate portions of a GPU’s compute or memory to workloads, offering flexibility for dynamic workloads. The NVIDIA KAI Scheduler
, part of the Run:ai platform, enables this by allowing pods to request fractional GPU resources.
To enable these features in EKS, you can deploy the NVIDIA Device Plugin, which exposes GPUs as schedulable resources and supports time-slicing and MIG. To learn more, see
Time-Slicing GPUs in Kubernetes
Example
For example, to enable time-slicing with the NVIDIA Device Plugin:
apiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin-config namespace: kube-system data: config.yaml: | version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4 # Allow 4 pods to share each GPU
Example
For example, to use KAI Scheduler for fractional GPU allocation, deploy it alongside the NVIDIA GPU Operator and specify fractional GPU resources in the pod spec:
apiVersion: v1 kind: Pod metadata: name: fractional-gpu-pod-example annotations: gpu-fraction: "0.5" # Annotation for 50% GPU labels: runai/queue: "default" # Required queue assignment spec: containers: - name: ml-workload image: nvcr.io/nvidia/pytorch:25.04-py3 resources: limits: nvidia.com/gpu: 1 nodeSelector: nvidia.com/gpu: "true" schedulerName: kai-scheduler
Node Resiliency and Training Job Management
Implement Node Health Checks with Automated Recovery
For distributed training jobs on Amazon EKS that require frequent inter-node communication, such as multi-GPU model training across multiple nodes, hardware issues like GPU or EFA failures can cause disruptions to training jobs. These disruptions can lead to loss of training progress and increased costs, particularly for long-running AI/ML workloads that rely on stable hardware.
To help add resilience against hardware failures, such as GPU failures in EKS clusters running GPU workloads, we recommend leveraging either the EKS Node Monitoring Agent with Auto Repair or Amazon SageMaker HyperPod. While the EKS Node Monitoring Agent with Auto Repair provides features like node health monitoring and auto-repair using standard Kubernetes mechanisms, SageMaker HyperPod offers targeted resilience and additional features specifically designed for large-scale ML training, such as deep health checks and automatic job resumption.
-
The EKS Node Monitoring Agent with Node Auto Repair continuously monitors node health by reading logs and applying NodeConditions, including standard conditions like
Ready
and conditions specific to accelerated hardware to identify issues like GPU or networking failures. When a node is deemed unhealthy, Node Auto Repair cordons it and replaces it with a new node. The rescheduling of pods and restarting of jobs rely on standard Kubernetes mechanisms and the job’s restart policy. -
The SageMaker HyperPod
deep health checks and health-monitoring agent continuously monitors the health status of GPU and Trainium-based instances. It is tailored for AI/ML workloads, using labels (e.g., node-health-status) to manage node health. When a node is deemed unhealthy, HyperPod triggers automatic replacement of the faulty hardware, such as GPUs. It detects networking-related failures for EFA through its basic health checks by default and supports auto-resume for interrupted training jobs, allowing jobs to continue from the last checkpoint, minimizing disruptions for large-scale ML tasks.
For both EKS Node Monitoring Agent with Auto Repair and SageMaker HyperPod clusters using EFA, to monitor EFA-specific metrics such as Remote Direct Memory Access (RDMA) errors and packet drops, make sure the AWS EFA driver is installed. In addition, we recommend deploying the CloudWatch Observability Add-on or using tools like DCGM Exporter with Prometheus and Grafana to monitor EFA, GPU, and, for SageMaker HyperPod, specific metrics related to its features.
Disable Karpenter Consolidation for interruption sensitive Workloads
For workload sensitive to interruptions, such as processing, large-scale AI/ML prediction tasks or training, we recommend tuning Karpenter consolidation policies
The WhenEmptyOrUnderutilized
consolidation policy may terminate nodes prematurely, leading to longer execution times. For example, interruptions may delay job resumption due to pod rescheduling, data reloading, which could be costly for long-running batch inference jobs. To mitigate this, you can set the consolidationPolicy
to WhenEmpty
and configure a consolidateAfter
duration, such as 1 hour, to retain nodes during workload spikes. For example:
disruption: consolidationPolicy: WhenEmpty consolidateAfter: 60m
This approach improves pod startup latency for spiky batch inference workloads and other interruption-sensitive jobs, such as real-time online inference data processing or model training, where the cost of interruption outweighs compute cost savings. Karpenter NodePool Disruption Budgets
Use ttlSecondsAfterFinished to Auto Clean-Up Kubernetes Jobs
We recommend setting ttlSecondsAfterFinished
for Kubernetes jobs in Amazon EKS to automatically delete completed job objects. Lingering job objects consume cluster resources, such as API server memory, and complicate monitoring by cluttering dashboards (e.g., Grafana, Amazon CloudWatch). For example, setting a TTL of 1 hour ensures jobs are removed shortly after completion, keeping your cluster tidy. For more details, refer to Automatic Cleanup for Finished Jobs
Configure Low-Priority Job Preemption for Higher-Priority Jobs/workloads
For mixed-priority AI/ML workloads on Amazon EKS, you may configure low-priority job preemption to ensure higher-priority tasks (e.g., real-time inference) receive resources promptly. Without preemption, low-priority workloads such as batch processes (e.g., batch inference, data processing), non-batch services (e.g., background tasks, cron jobs), or CPU/memory-intensive jobs (e.g., web services) can delay critical pods by occupying nodes. Preemption allows Kubernetes to evict low-priority pods when high-priority pods need resources, ensuring efficient resource allocation on nodes with GPUs, CPUs, or memory. We recommend using Kubernetes PriorityClass
to assign priorities and PodDisruptionBudget
to control eviction behavior.
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 100 --- spec: priorityClassName: low-priority
See the Kubernetes Priority and Preemption Documentation
Application Scaling and Performance
Tailor Compute Capacity for ML workloads with Karpenter or Static Nodes
To ensure cost-efficient and responsive compute capacity for machine learning (ML) workflows on Amazon EKS, we recommend tailoring your node provisioning strategy to your workload’s characteristics and cost commitments. Below are two approaches to consider: just-in-time scaling with Karpenter
-
Just-in-time data plane scalers like Karpenter: For dynamic ML workflows with variable compute demands (e.g., GPU-based inference followed by CPU-based plotting), we recommend using just-in-time data plane scalers like Karpenter.
-
Use static node groups for predictable workloads: For predictable, steady-state ML workloads or when using Reserved instances, EKS managed node groups can help ensure reserved capacity is fully provisioned and utilized, maximizing savings. This approach is ideal for specific instance types committed via RIs or ODCRs.
Example
This is an example of a diverse Karpenter NodePoolg
Amazon EC2 instances where instance generation is greater than three.
apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: gpu-inference spec: template: spec: nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default requirements: - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] - key: karpenter.k8s.aws/instance-category operator: In values: ["g"] - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["3"] - key: kubernetes.io/arch operator: In values: ["amd64"] taints: - key: nvidia.com/gpu effect: NoSchedule limits: cpu: "1000" memory: "4000Gi" nvidia.com/gpu: "10" *# Limit the total number of GPUs to 10 for the NodePool* disruption: consolidationPolicy: WhenEmpty consolidateAfter: 60m expireAfter: 720h
Example
Example using static node groups for a training workload:
apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: ml-cluster region: us-west-2 managedNodeGroups: - name: gpu-node-group instanceType: p4d.24xlarge minSize: 2 maxSize: 2 desiredCapacity: 2 taints: - key: nvidia.com/gpu effect: NoSchedule
Use taints and tolerations to prevent non-accelerated workloads from being scheduled on accelerated instances
Scheduling non accelerated workloads on GPU resources is not compute-efficient, we recommend using taints and toleration to ensure non accelerated workloads pods are not scheduled on inappropriate nodes. See the Kubernetes documentation
Scale Based on Model Performance
For inference workloads, we recommend using Kubernetes Event-Driven Autoscaling (KEDA) to scale based on model performance metrics like inference requests or token throughput, with appropriate cooldown periods. Static scaling policies may over- or under-provision resources, impacting cost and latency. Learn more in the KEDA Documentation
Dynamic resource allocation for advanced GPU management
Dynamic
resource allocation (DRA)
-
Fine-grained GPU allocation
-
Advanced sharing mechanisms, such as Multi-Process service (MPS) and Multi-Instance GPU (MIG)
-
Support for next-generation hardware architectures, including NVIDIA GB200 UltraServers
Traditional GPU allocation treats GPUs as opaque integer resources, creating significant under-utilization (often 30-40% in production clusters). This occurs because workloads receive exclusive access to entire GPUs even when requiring only fractional resources. DRA transforms this model by introducing structured, declarative allocation that provides the Kubernetes scheduler with complete visibility into hardware characteristics and workload requirements. This enables intelligent placement decisions and efficient resource sharing.
Advantages of using DRA instead of NVIDIA device plugin
The NVIDIA device plugin (starting from version 0.12.0
) supports GPU
sharing mechanisms including time-slicing, MPS, and MIG. However,
architectural limitations exist that DRA addresses.
NVIDIA device plugin limitations
-
Static configuration: GPU sharing configurations (time-slicing replicas and MPS settings) require pre-configuration cluster-wide through
ConfigMaps
. This makes providing different sharing strategies for different workloads difficult. -
Limited granular selection: While the device plugin exposes GPU characteristics through node labels, workloads cannot dynamically request specific GPU configurations (memory size and compute capabilities) as part of the scheduling decision.
-
No cross-node resource coordination: Cannot manage distributed GPU resources across multiple nodes or express complex topology requirements like NVLink domains for systems like NVIDIA GB200.
-
Scheduler constraints: The Kubernetes scheduler treats GPU resources as opaque integers, limiting its ability to make topology-aware decisions or handle complex resource dependencies.
-
Configuration complexity: Setting up different sharing strategies requires multiple
ConfigMaps
and careful node labeling, creating operational complexity.
Solutions with DRA
-
Dynamic resource selection: DRA allows workloads to specify detailed requirements (GPU memory, driver versions, and specific attributes) at request time through
resourceclaims
. This enables more flexible resource matching. -
Topology awareness: Through structured parameters and device selectors, DRA handles complex requirements like cross-node GPU communication and memory-coherent interconnects.
-
Cross-node resource management:
computeDomains
enable coordination of distributed GPU resources across multiple nodes, critical for systems like GB200 with IMEX channels. -
Workload-specific configuration: Each
ResourceClaim
specifies different sharing strategies and configurations, allowing fine-grained control per workload rather than cluster-wide settings. -
Enhanced scheduler integration: DRA provides the scheduler with detailed device information and enables more intelligent placement decisions based on hardware topology and resource characteristics.
Important: DRA does not replace the NVIDIA device plugin entirely. The NVIDIA DRA driver works alongside the device plugin to provide enhanced capabilities. The device plugin continues to handle basic GPU discovery and management, while DRA adds advanced allocation and scheduling features.
Instances supported by DRA and their features
DRA support varies by Amazon EC2 instance family and GPU architecture, as shown in the following table.
Instance family | GPU type | Time-slicing | MIG support | MPS support | IMEX support | Use cases |
---|---|---|---|---|---|---|
G5 |
NVIDIA A10G |
Yes |
No |
Yes |
No |
Inference and graphics workloads |
G6 |
NVIDIA L4 |
Yes |
No |
Yes |
No |
AI inference and video processing |
G6e |
NVIDIA L40S |
Yes |
No |
Yes |
No |
Training, inference, and graphics |
P4d/P4de |
NVIDIA A100 |
Yes |
Yes |
Yes |
No |
Large-scale training and HPC |
P5 |
NVIDIA H100 |
Yes |
Yes |
Yes |
No |
Foundation model training |
P6 |
NVIDIA B200 |
Yes |
Yes |
Yes |
No |
Billion or trillion-parameter models, distributed training, and inference |
P6e |
NVIDIA GB200 |
Yes |
Yes |
Yes |
Yes |
Billion or trillion-parameter models, distributed training, and inference |
The following are descriptions of each feature in the table:
-
Time-slicing: Allows multiple workloads to share GPU compute resources over time.
-
Multi-Instance GPU (MIG): Hardware-level partitioning that creates isolated GPU instances.
-
Multi-Process service (MPS): Enables concurrent execution of multiple CUDA processes on a single GPU.
-
Internode Memory Exchange (IMEX): Memory-coherent communication across nodes for GB200 UltraServers.
Additional resources
For more information about Kubernetes DRA and NVIDIA DRA drivers, see the following resources on GitHub:
Set up dynamic resource allocation for advanced GPU management
The following topic shows you how to setup dynamic resource allocation (DRA) for advanced GPU management.
Prerequisites
Before implementing DRA on Amazon EKS, ensure your environment meets the following requirements.
Cluster configuration
-
Amazon EKS cluster running version
1.33
or later -
Amazon EKS managed node groups (DRA is currently supported only by managed node groups with AL2023 and Bottlerocket NVIDIA optimized AMIs, not with Karpenter
) -
NVIDIA GPU-enabled worker nodes with appropriate instance types
Required components
-
NVIDIA device plugin version
0.17.1
or later -
NVIDIA DRA driver version
25.3.0
or later
Step 1: Create cluster with DRA-enabled node group using eksctl
-
Create a cluster configuration file named
dra-eks-cluster.yaml
:--- apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: dra-eks-cluster region: us-west-2 version: '1.33' managedNodeGroups: - name: gpu-dra-nodes amiFamily: AmazonLinux2023 instanceType: g6.12xlarge desiredCapacity: 2 minSize: 1 maxSize: 3 labels: node-type: "gpu-dra" nvidia.com/gpu.present: "true" taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule
-
Create the cluster:
eksctl create cluster -f dra-eks-cluster.yaml
Step 2: Deploy the NVIDIA device plugin
Deploy the NVIDIA device plugin to enable basic GPU discovery:
-
Add the NVIDIA device plugin Helm repository:
helm repo add nvidia https://nvidia.github.io/k8s-device-plugin helm repo update
-
Create custom values for the device plugin:
cat <<EOF > nvidia-device-plugin-values.yaml gfd: enabled: true nfd: enabled: true tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule EOF
-
Install the NVIDIA device plug-in:
helm install nvidia-device-plugin nvidia/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version v0.17.1 \ --values nvidia-device-plugin-values.yaml
Step 3: Deploy NVIDIA DRA driver Helm chart
-
Create a
dra-driver-values.yaml
values file for the DRA driver:--- nvidiaDriverRoot: / gpuResourcesEnabledOverride: true resources: gpus: enabled: true computeDomains: enabled: true # Enable for GB200 IMEX support controller: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule kubeletPlugin: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "nvidia.com/gpu.present" operator: In values: ["true"] tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
-
Add the NVIDIA NGC Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
-
Install the NVIDIA DRA driver:
helm install nvidia-dra-driver nvidia/nvidia-dra-driver-gpu \ --version="25.3.0-rc.2" \ --namespace nvidia-dra-driver \ --create-namespace \ --values dra-driver-values.yaml
Step 4: Verify the DRA installation
-
Verify that the DRA API resources are available:
kubectl api-resources | grep resource.k8s.io/v1beta1
The following is the expected output:
deviceclasses resource.k8s.io/v1beta1 false DeviceClass resourceclaims resource.k8s.io/v1beta1 true ResourceClaim resourceclaimtemplates resource.k8s.io/v1beta1 true ResourceClaimTemplate resourceslices resource.k8s.io/v1beta1 false ResourceSlice
-
Check the available device classes:
kubectl get deviceclasses
The following is an example of expected output:
NAME AGE compute-domain-daemon.nvidia.com 4h39m compute-domain-default-channel.nvidia.com 4h39m gpu.nvidia.com 4h39m mig.nvidia.com 4h39m
When a newly created G6 GPU instance joins your Amazon EKS cluster with DRA enabled, the following actions occur:
-
The NVIDIA DRA driver automatically discovers the A10G GPU and creates two
resourceslices
on that node. -
The
gpu.nvidia.com
slice registers the physical A10G GPU device with its specifications (memory, compute capability, and more). -
Since A10G doesn’t support MIG partitioning, the
compute-domain.nvidia.com
slice creates a single compute domain representing the entire compute context of the GPU. -
These
resourceslices
are then published to the Kubernetes API server, making the GPU resources available for scheduling throughresourceclaims
.The DRA scheduler can now intelligently allocate this GPU to Pods that request GPU resources through
resourceclaimtemplates
, providing more flexible resource management compared to traditional device plugin approaches. This happens automatically without manual intervention. The node simply becomes available for GPU workloads once the DRA driver completes the resource discovery and registration process.When you run the following command:
kubectl get resourceslices
The following is an example of expected output:
NAME NODE DRIVER POOL AGE ip-100-64-129-47.ec2.internal-compute-domain.nvidia.com-rwsts ip-100-64-129-47.ec2.internal compute-domain.nvidia.com ip-100-64-129-47.ec2.internal 35m ip-100-64-129-47.ec2.internal-gpu.nvidia.com-6kndg ip-100-64-129-47.ec2.internal gpu.nvidia.com ip-100-64-129-47.ec2.internal 35m
-
Continue to Schedule a simple GPU workload using dynamic resource allocation.
Schedule a simple GPU workload using dynamic resource allocation
To schedule a simple GPU workload using dynamic resource allocation (DRA), do the following steps. Before proceeding, make sure you have followed Set up dynamic resource allocation for advanced GPU management.
-
Create a basic
ResourceClaimTemplate
for GPU allocation with a file namedbasic-gpu-claim-template.yaml
:--- apiVersion: v1 kind: Namespace metadata: name: gpu-test1 --- apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: namespace: gpu-test1 name: single-gpu spec: spec: devices: requests: - name: gpu deviceClassName: gpu.nvidia.com
-
Apply the template:
kubectl apply -f basic-gpu-claim-template.yaml
-
Verify the status:
kubectl get resourceclaimtemplates -n gpu-test1
The following is example output:
NAME AGE single-gpu 9m16s
-
Create a Pod that uses the
ResourceClaimTemplate
with a file namedbasic-gpu-pod.yaml
:--- apiVersion: v1 kind: Pod metadata: namespace: gpu-test1 name: gpu-pod labels: app: pod spec: containers: - name: ctr0 image: ubuntu:22.04 command: ["bash", "-c"] args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: gpu0 resourceClaims: - name: gpu0 resourceClaimTemplateName: single-gpu nodeSelector: NodeGroupType: gpu-dra nvidia.com/gpu.present: "true" tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule"
-
Apply and monitor the Pod:
kubectl apply -f basic-gpu-pod.yaml
-
Check the Pod status:
kubectl get pod -n gpu-test1
The following is example expected output:
NAME READY STATUS RESTARTS AGE gpu-pod 1/1 Running 0 13m
-
Check the
ResourceClaim
status:kubectl get resourceclaims -n gpu-test1
The following is example expected output:
NAME STATE AGE gpu-pod-gpu0-l76cg allocated,reserved 9m6s
-
View Pod logs to see GPU information:
kubectl logs gpu-pod -n gpu-test1
The following is example expected output:
GPU 0: NVIDIA L4 (UUID: GPU-da7c24d7-c7e3-ed3b-418c-bcecc32af7c5)
Continue to GPU optimization techniques with dynamic resource allocation for more advanced GPU optimization techniques using DRA.
GPU optimization techniques with dynamic resource allocation
Modern GPU workloads require sophisticated resource management to achieve optimal utilization and cost efficiency. DRA enables several advanced optimization techniques that address different use cases and hardware capabilities:
-
Time-slicing allows multiple workloads to share GPU compute resources over time, making it ideal for inference workloads with sporadic GPU usage. For an example, see Optimize GPU workloads with time-slicing.
-
Multi-Process service (MPS) enables concurrent execution of multiple CUDA processes on a single GPU with better isolation than time-slicing. For an example, see Optimize GPU workloads with MPS.
-
Multi-Instance GPU (MIG) provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources. For an example, see Optimize GPU workloads with Multi-Instance GPU.
-
Internode Memory Exchange (IMEX) enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 systems. For an example, see Optimize GPU workloads with IMEX using GB200 P6e instances.
These techniques can significantly improve resource utilization. Organizations report GPU utilization increases from 30-40% with traditional allocation to 80-90% with optimized sharing strategies. The choice of technique depends on workload characteristics, isolation requirements, and hardware capabilities.
Optimize GPU workloads with time-slicing
Time-slicing enables multiple workloads to share GPU compute resources by scheduling them to run sequentially on the same physical GPU. It is ideal for inference workloads with sporadic GPU usage.
Do the following steps.
-
Define a
ResourceClaimTemplate
for time-slicing with a file namedtimeslicing-claim-template.yaml
:--- apiVersion: v1 kind: Namespace metadata: name: timeslicing-gpu --- apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: timeslicing-gpu-template namespace: timeslicing-gpu spec: spec: devices: requests: - name: shared-gpu deviceClassName: gpu.nvidia.com config: - requests: ["shared-gpu"] opaque: driver: gpu.nvidia.com parameters: apiVersion: resource.nvidia.com/v1beta1 kind: GpuConfig sharing: strategy: TimeSlicing
-
Define a Pod using time-slicing with a file named
timeslicing-pod.yaml
:--- # Pod 1 - Inference workload apiVersion: v1 kind: Pod metadata: name: inference-pod-1 namespace: timeslicing-gpu labels: app: gpu-inference spec: restartPolicy: Never containers: - name: inference-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "-c"] args: - | import torch import time import os print(f"=== POD 1 STARTING ===") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Current GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB") # Simulate inference workload for i in range(20): x = torch.randn(1000, 1000).cuda() y = torch.mm(x, x.t()) print(f"Pod 1 - Iteration {i+1} completed at {time.strftime('%H:%M:%S')}") time.sleep(60) else: print("No GPU available!") time.sleep(5) resources: claims: - name: shared-gpu-claim resourceClaims: - name: shared-gpu-claim resourceClaimTemplateName: timeslicing-gpu-template nodeSelector: NodeGroupType: "gpu-dra" nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule --- # Pod 2 - Training workload apiVersion: v1 kind: Pod metadata: name: training-pod-2 namespace: timeslicing-gpu labels: app: gpu-training spec: restartPolicy: Never containers: - name: training-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "-c"] args: - | import torch import time import os print(f"=== POD 2 STARTING ===") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Current GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB") # Simulate training workload with heavier compute for i in range(15): x = torch.randn(2000, 2000).cuda() y = torch.mm(x, x.t()) loss = torch.sum(y) print(f"Pod 2 - Training step {i+1}, Loss: {loss.item():.2f} at {time.strftime('%H:%M:%S')}") time.sleep(5) else: print("No GPU available!") time.sleep(60) resources: claims: - name: shared-gpu-claim-2 resourceClaims: - name: shared-gpu-claim-2 resourceClaimTemplateName: timeslicing-gpu-template nodeSelector: NodeGroupType: "gpu-dra" nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
-
Apply the template and Pod:
kubectl apply -f timeslicing-claim-template.yaml kubectl apply -f timeslicing-pod.yaml
-
Monitor resource claims:
kubectl get resourceclaims -n timeslicing-gpu -w
The following is example output:
NAME STATE AGE inference-pod-1-shared-gpu-claim-9p97x allocated,reserved 21s training-pod-2-shared-gpu-claim-2-qghnb pending 21s inference-pod-1-shared-gpu-claim-9p97x pending 105s training-pod-2-shared-gpu-claim-2-qghnb pending 105s inference-pod-1-shared-gpu-claim-9p97x pending 105s training-pod-2-shared-gpu-claim-2-qghnb allocated,reserved 105s inference-pod-1-shared-gpu-claim-9p97x pending 105s
First Pod (inference-pod-1
)
-
State:
allocated,reserved
-
Meaning: DRA found an available GPU and reserved it for this Pod
-
Pod status: Starts running immediately
Second Pod (training-pod-2
)
-
State:
pending
-
Meaning: Waiting for DRA to configure time-slicing on the same GPU
-
Pod status: Waiting to be scheduled
-
The state will go from
pending
toallocated,reserved
torunning
Optimize GPU workloads with MPS
Multi-Process Service (MPS) enables concurrent execution of multiple CUDA contexts on a single GPU with better isolation than time-slicing.
Do the following steps.
-
Define a
ResourceClaimTemplate
for MPS with a file namedmps-claim-template.yaml
:--- apiVersion: v1 kind: Namespace metadata: name: mps-gpu --- apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: mps-gpu-template namespace: mps-gpu spec: spec: devices: requests: - name: shared-gpu deviceClassName: gpu.nvidia.com config: - requests: ["shared-gpu"] opaque: driver: gpu.nvidia.com parameters: apiVersion: resource.nvidia.com/v1beta1 kind: GpuConfig sharing: strategy: MPS
-
Define a Pod using MPS with a file named
mps-pod.yaml
:--- # Single Pod with Multiple Containers sharing GPU via MPS apiVersion: v1 kind: Pod metadata: name: mps-multi-container-pod namespace: mps-gpu labels: app: mps-demo spec: restartPolicy: Never containers: # Container 1 - Inference workload - name: inference-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "-c"] args: - | import torch import torch.nn as nn import time import os print(f"=== INFERENCE CONTAINER STARTING ===") print(f"Process ID: {os.getpid()}") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Current GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB") # Create inference model model = nn.Sequential( nn.Linear(1000, 500), nn.ReLU(), nn.Linear(500, 100) ).cuda() # Run inference for i in range(1, 999999): with torch.no_grad(): x = torch.randn(128, 1000).cuda() output = model(x) result = torch.sum(output) print(f"Inference Container PID {os.getpid()}: Batch {i}, Result: {result.item():.2f} at {time.strftime('%H:%M:%S')}") time.sleep(2) else: print("No GPU available!") time.sleep(60) resources: claims: - name: shared-gpu-claim request: shared-gpu # Container 2 - Training workload - name: training-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "-c"] args: - | import torch import torch.nn as nn import time import os print(f"=== TRAINING CONTAINER STARTING ===") print(f"Process ID: {os.getpid()}") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Current GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB") # Create training model model = nn.Sequential( nn.Linear(2000, 1000), nn.ReLU(), nn.Linear(1000, 500), nn.ReLU(), nn.Linear(500, 10) ).cuda() criterion = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Run training for epoch in range(1, 999999): x = torch.randn(64, 2000).cuda() target = torch.randn(64, 10).cuda() optimizer.zero_grad() output = model(x) loss = criterion(output, target) loss.backward() optimizer.step() print(f"Training Container PID {os.getpid()}: Epoch {epoch}, Loss: {loss.item():.4f} at {time.strftime('%H:%M:%S')}") time.sleep(3) else: print("No GPU available!") time.sleep(60) resources: claims: - name: shared-gpu-claim request: shared-gpu resourceClaims: - name: shared-gpu-claim resourceClaimTemplateName: mps-gpu-template nodeSelector: NodeGroupType: "gpu-dra" nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
-
Apply the template and create multiple MPS Pods:
kubectl apply -f mps-claim-template.yaml kubectl apply -f mps-pod.yaml
-
Monitor the resource claims:
kubectl get resourceclaims -n mps-gpu -w
The following is example output:
NAME STATE AGE mps-multi-container-pod-shared-gpu-claim-2p9kx allocated,reserved 86s
This configuration demonstrates true GPU sharing using NVIDIA
Multi-Process Service (MPS) through dynamic resource allocation (DRA).
Unlike time-slicing where workloads take turns using the GPU
sequentially, MPS enables both containers to run simultaneously on the
same physical GPU. The key insight is that DRA MPS sharing requires
multiple containers within a single Pod, not multiple separate Pods.
When deployed, the DRA driver allocates one ResourceClaim
to the Pod
and automatically configures MPS to allow both the inference and
training containers to execute concurrently.
Each container gets its own isolated GPU memory space and compute resources, with the MPS daemon coordinating access to the underlying hardware. You can verify this is working by doing the following:
-
Checking
nvidia-smi
, which will show both containers as M+C (MPS + Compute
) processes sharing the same GPU device. -
Monitoring the logs from both containers, which will display interleaved timestamps proving simultaneous execution.
This approach maximizes GPU utilization by allowing complementary workloads to share the expensive GPU hardware efficiently, rather than leaving it underutilized by a single process.
Container1: inference-container
root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:09:30 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 |
| N/A 48C P0 28W / 72W | 597MiB / 23034MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1 M+C python 246MiB |
+-----------------------------------------------------------------------------------------+
Container2: training-container
root@mps-multi-container-pod:/workspace# nvidia-smi
Wed Jul 16 21:16:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 |
| N/A 51C P0 28W / 72W | 597MiB / 23034MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1 M+C python 314MiB |
+-----------------------------------------------------------------------------------------+
Optimize GPU workloads with Multi-Instance GPU
Multi-instance GPU (MIG) provides hardware-level partitioning, creating isolated GPU instances with dedicated compute and memory resources.
Using dynamic MIG partitioning with various profiles requires the
NVIDIA GPU OperatorWITH0—REBOOT=true
in the MIG Manager configuration is essential for
successful MIG deployments.
You need both NVIDIA DRA
Driver
Step 1: Deploy NVIDIA GPU Operator
-
Add the NVIDIA GPU Operator repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update
-
Create a
gpu-operator-values.yaml
file:driver: enabled: false mig: strategy: mixed migManager: enabled: true env: - name: WITH_REBOOT value: "true" config: create: true name: custom-mig-parted-configs default: "all-disabled" data: config.yaml: |- version: v1 mig-configs: all-disabled: - devices: all mig-enabled: false # P4D profiles (A100 40GB) p4d-half-balanced: - devices: [0, 1, 2, 3] mig-enabled: true mig-devices: "1g.5gb": 2 "2g.10gb": 1 "3g.20gb": 1 - devices: [4, 5, 6, 7] mig-enabled: false # P4DE profiles (A100 80GB) p4de-half-balanced: - devices: [0, 1, 2, 3] mig-enabled: true mig-devices: "1g.10gb": 2 "2g.20gb": 1 "3g.40gb": 1 - devices: [4, 5, 6, 7] mig-enabled: false devicePlugin: enabled: true config: name: "" create: false default: "" toolkit: enabled: true nfd: enabled: true gfd: enabled: true dcgmExporter: enabled: true serviceMonitor: enabled: true interval: 15s honorLabels: false additionalLabels: release: kube-prometheus-stack nodeStatusExporter: enabled: false operator: defaultRuntime: containerd runtimeClass: nvidia resources: limits: cpu: 500m memory: 350Mi requests: cpu: 200m memory: 100Mi daemonsets: tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" nodeSelector: accelerator: nvidia priorityClassName: system-node-critical
-
Install GPU Operator using the
gpu-operator-values.yaml
file:helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --version v25.3.1 \ --values gpu-operator-values.yaml
This Helm chart deploys the following components and multiple MIG profiles:
-
Device Plugin (GPU resource scheduling)
-
DCGM Exporter (GPU metrics and monitoring)
-
Node Feature Discovery (NFD - hardware labeling)
-
GPU Feature Discovery (GFD - GPU-specific labeling)
-
MIG Manager (Multi-instance GPU partitioning)
-
Container Toolkit (GPU container runtime)
-
Operator Controller (lifecycle management)
-
-
Verify the deployment Pods:
kubectl get pods -n gpu-operator
The following is example output:
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-27rdq 1/1 Running 0 3h31m gpu-operator-555774698d-48brn 1/1 Running 0 4h8m nvidia-container-toolkit-daemonset-sxmh9 1/1 Running 1 (3h32m ago) 4h1m nvidia-cuda-validator-qb77g 0/1 Completed 0 3h31m nvidia-dcgm-exporter-cvzd7 1/1 Running 0 3h31m nvidia-device-plugin-daemonset-5ljm5 1/1 Running 0 3h31m nvidia-gpu-operator-node-feature-discovery-gc-67f66fc557-q5wkt 1/1 Running 0 4h8m nvidia-gpu-operator-node-feature-discovery-master-5d8ffddcsl6s6 1/1 Running 0 4h8m nvidia-gpu-operator-node-feature-discovery-worker-6t4w7 1/1 Running 1 (3h32m ago) 4h1m nvidia-gpu-operator-node-feature-discovery-worker-9w7g8 1/1 Running 0 4h8m nvidia-gpu-operator-node-feature-discovery-worker-k5fgs 1/1 Running 0 4h8m nvidia-mig-manager-zvf54 1/1 Running 1 (3h32m ago) 3h35m
-
Create an Amazon EKS cluster with a p4De managed node group for testing the MIG examples:
apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: dra-eks-cluster region: us-east-1 version: '1.33' managedNodeGroups: # P4DE MIG Node Group with Capacity Block Reservation - name: p4de-mig-nodes amiFamily: AmazonLinux2023 instanceType: p4de.24xlarge # Capacity settings desiredCapacity: 0 minSize: 0 maxSize: 1 # Use specific subnet in us-east-1b for capacity reservation subnets: - us-east-1b # AL2023 NodeConfig for RAID0 local storage only nodeadmConfig: apiVersion: node.eks.aws/v1alpha1 kind: NodeConfig spec: instance: localStorage: strategy: RAID0 # Node labels for MIG configuration labels: nvidia.com/gpu.present: "true" nvidia.com/gpu.product: "A100-SXM4-80GB" nvidia.com/mig.config: "p4de-half-balanced" node-type: "p4de" vpc.amazonaws.com/efa.present: "true" accelerator: "nvidia" # Node taints taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule # EFA support efaEnabled: true # Placement group for high-performance networking placementGroup: groupName: p4de-placement-group strategy: cluster # Capacity Block Reservation (CBR) # Ensure CBR ID matches the subnet AZ with the Nodegroup subnet spot: false capacityReservation: capacityReservationTarget: capacityReservationId: "cr-abcdefghij" # Replace with your capacity reservation ID
NVIDIA GPU Operator uses the label added to nodes
nvidia.com/mig.config: "p4de-half-balanced"
and partitions the GPU with the given profile. -
Login to the
p4de
instance. -
Run the following command:
nvidia-smi -L
You should see the following example output:
[root@ip-100-64-173-145 bin]# nvidia-smi -L GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-ab52e33c-be48-38f2-119e-b62b9935925a) MIG 3g.40gb Device 0: (UUID: MIG-da972af8-a20a-5f51-849f-bc0439f7970e) MIG 2g.20gb Device 1: (UUID: MIG-7f9768b7-11a6-5de9-a8aa-e9c424400da4) MIG 1g.10gb Device 2: (UUID: MIG-498adad6-6cf7-53af-9d1a-10cfd1fa53b2) MIG 1g.10gb Device 3: (UUID: MIG-3f55ef65-1991-571a-ac50-0dbf50d80c5a) GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-0eabeccc-7498-c282-0ac7-d3c09f6af0c8) MIG 3g.40gb Device 0: (UUID: MIG-80543849-ea3b-595b-b162-847568fe6e0e) MIG 2g.20gb Device 1: (UUID: MIG-3af1958f-fac4-59f1-8477-9f8d08c55029) MIG 1g.10gb Device 2: (UUID: MIG-401088d2-716f-527b-a970-b1fc7a4ac6b2) MIG 1g.10gb Device 3: (UUID: MIG-8c56c75e-5141-501c-8f43-8cf22f422569) GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-1c7a1289-243f-7872-a35c-1d2d8af22dd0) MIG 3g.40gb Device 0: (UUID: MIG-e9b44486-09fc-591a-b904-0d378caf2276) MIG 2g.20gb Device 1: (UUID: MIG-ded93941-9f64-56a3-a9b1-a129c6edf6e4) MIG 1g.10gb Device 2: (UUID: MIG-6c317d83-a078-5c25-9fa3-c8308b379aa1) MIG 1g.10gb Device 3: (UUID: MIG-2b070d39-d4e9-5b11-bda6-e903372e3d08) GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-9a6250e2-5c59-10b7-2da8-b61d8a937233) MIG 3g.40gb Device 0: (UUID: MIG-20e3cd87-7a57-5f1b-82e7-97b14ab1a5aa) MIG 2g.20gb Device 1: (UUID: MIG-04430354-1575-5b42-95f4-bda6901f1ace) MIG 1g.10gb Device 2: (UUID: MIG-d62ec8b6-e097-5e99-a60c-abf8eb906f91) MIG 1g.10gb Device 3: (UUID: MIG-fce20069-2baa-5dd4-988a-cead08348ada) GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-5d09daf0-c2eb-75fd-3919-7ad8fafa5f86) GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-99194e04-ab2a-b519-4793-81cb2e8e9179) GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-c1a1910f-465a-e16f-5af1-c6aafe499cd6) GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-c2cfafbc-fd6e-2679-e955-2a9e09377f78)
NVIDIA GPU Operator has successfully applied the p4de-half-balanced
MIG profile to your P4DE instance, creating hardware-level GPU
partitions as configured. Here’s how the partitioning works:
The GPU Operator applied this configuration from your embedded MIG profile:
p4de-half-balanced:
- devices: [0, 1, 2, 3] # First 4 GPUs: MIG enabled
mig-enabled: true
mig-devices:
"1g.10gb": 2 # 2x small instances (10GB each)
"2g.20gb": 1 # 1x medium instance (20GB)
"3g.40gb": 1 # 1x large instance (40GB)
- devices: [4, 5, 6, 7] # Last 4 GPUs: Full GPUs
mig-enabled: false
From your nvidia-smi -L
output, here’s what the GPU Operator created:
-
MIG-enabled GPUs (0-3): hardware partitioned
-
GPU 0: NVIDIA A100-SXM4-80GB
-
MIG 3g.40gb Device 0 – Large workloads (40GB memory, 42 SMs)
-
MIG 2g.20gb Device 1 – Medium workloads (20GB memory, 28 SMs)
-
MIG 1g.10gb Device 2 – Small workloads (10GB memory, 14 SMs)
-
MIG 1g.10gb Device 3 – Small workloads (10GB memory, 14 SMs)
-
-
GPU 1: NVIDIA A100-SXM4-80GB
-
MIG 3g.40gb Device 0 – Identical partition layout
-
MIG 2g.20gb Device 1
-
MIG 1g.10gb Device 2
-
MIG 1g.10gb Device 3
-
-
GPU 2 and GPU 3 – Same pattern as GPU 0 and GPU 1
-
-
Full GPUs (4-7): No MIG partitioning
-
GPU 4: NVIDIA A100-SXM4-80GB – Full 80GB GPU
-
GPU 5: NVIDIA A100-SXM4-80GB – Full 80GB GPU
-
GPU 6: NVIDIA A100-SXM4-80GB – Full 80GB GPU
-
GPU 7: NVIDIA A100-SXM4-80GB – Full 80GB GPU
-
Once the NVIDIA GPU Operator creates the MIG partitions, the NVIDIA DRA
Driver automatically detects these hardware-isolated instances and makes
them available for dynamic resource allocation in Kubernetes. The DRA
driver discovers each MIG instance with its specific profile (1g.10gb,
2g.20gb, 3g.40gb) and exposes them as schedulable resources through the
mig.nvidia.com
device class.
The DRA driver continuously monitors the MIG topology and maintains an
inventory of available instances across all GPUs. When a Pod requests a
specific MIG profile through a ResourceClaimTemplate
, the DRA driver
intelligently selects an appropriate MIG instance from any available
GPU, enabling true hardware-level multi-tenancy. This dynamic allocation
allows multiple isolated workloads to run simultaneously on the same
physical GPU while maintaining strict resource boundaries and
performance guarantees.
Step 2: Test MIG resource allocation
Now let’s run some examples to demonstrate how DRA dynamically allocates
MIG instances to different workloads. Deploy the
resourceclaimtemplates
and test pods to see how the DRA driver places
workloads across the available MIG partitions, allowing multiple
containers to share GPU resources with hardware-level isolation.
-
Create
mig-claim-template.yaml
to contain the MIGresourceclaimtemplates
:apiVersion: v1 kind: Namespace metadata: name: mig-gpu --- # Template for 3g.40gb MIG instance (Large training) apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: mig-large-template namespace: mig-gpu spec: spec: devices: requests: - name: mig-large deviceClassName: mig.nvidia.com selectors: - cel: expression: | device.attributes['gpu.nvidia.com'].profile == '3g.40gb' --- # Template for 2g.20gb MIG instance (Medium training) apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: mig-medium-template namespace: mig-gpu spec: spec: devices: requests: - name: mig-medium deviceClassName: mig.nvidia.com selectors: - cel: expression: | device.attributes['gpu.nvidia.com'].profile == '2g.20gb' --- # Template for 1g.10gb MIG instance (Small inference) apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: mig-small-template namespace: mig-gpu spec: spec: devices: requests: - name: mig-small deviceClassName: mig.nvidia.com selectors: - cel: expression: | device.attributes['gpu.nvidia.com'].profile == '1g.10gb'
-
Apply the three templates:
kubectl apply -f mig-claim-template.yaml
-
Run the following command:
kubectl get resourceclaimtemplates -n mig-gpu
The following is example output:
NAME AGE mig-large-template 71m mig-medium-template 71m mig-small-template 71m
-
Create
mig-pod.yaml
to schedule multiple jobs to leverage thisresourceclaimtemplates
:--- # ConfigMap containing Python scripts for MIG pods apiVersion: v1 kind: ConfigMap metadata: name: mig-scripts-configmap namespace: mig-gpu data: large-training-script.py: | import torch import torch.nn as nn import torch.optim as optim import time import os print(f"=== LARGE TRAINING POD (3g.40gb) ===") print(f"Process ID: {os.getpid()}") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Using GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB") # Large model for 3g.40gb instance model = nn.Sequential( nn.Linear(2048, 1024), nn.ReLU(), nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10) ).cuda() optimizer = optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss() print(f"Model parameters: {sum(p.numel() for p in model.parameters())}") # Training loop for epoch in range(100): # Large batch for 3g.40gb x = torch.randn(256, 2048).cuda() y = torch.randint(0, 10, (256,)).cuda() optimizer.zero_grad() output = model(x) loss = criterion(output, y) loss.backward() optimizer.step() if epoch % 10 == 0: print(f"Large Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB") time.sleep(3) print("Large training completed on 3g.40gb MIG instance") medium-training-script.py: | import torch import torch.nn as nn import torch.optim as optim import time import os print(f"=== MEDIUM TRAINING POD (2g.20gb) ===") print(f"Process ID: {os.getpid()}") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Using GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB") # Medium model for 2g.20gb instance model = nn.Sequential( nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10) ).cuda() optimizer = optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss() print(f"Model parameters: {sum(p.numel() for p in model.parameters())}") # Training loop for epoch in range(100): # Medium batch for 2g.20gb x = torch.randn(128, 1024).cuda() y = torch.randint(0, 10, (128,)).cuda() optimizer.zero_grad() output = model(x) loss = criterion(output, y) loss.backward() optimizer.step() if epoch % 10 == 0: print(f"Medium Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB") time.sleep(4) print("Medium training completed on 2g.20gb MIG instance") small-inference-script.py: | import torch import torch.nn as nn import time import os print(f"=== SMALL INFERENCE POD (1g.10gb) ===") print(f"Process ID: {os.getpid()}") print(f"GPU available: {torch.cuda.is_available()}") print(f"GPU count: {torch.cuda.device_count()}") if torch.cuda.is_available(): device = torch.cuda.current_device() print(f"Using GPU: {torch.cuda.get_device_name(device)}") print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB") # Small model for 1g.10gb instance model = nn.Sequential( nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10) ).cuda() print(f"Model parameters: {sum(p.numel() for p in model.parameters())}") # Inference loop for i in range(200): with torch.no_grad(): # Small batch for 1g.10gb x = torch.randn(32, 512).cuda() output = model(x) prediction = torch.argmax(output, dim=1) if i % 20 == 0: print(f"Small Inference - Batch {i}, Predictions: {prediction[:5].tolist()}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB") time.sleep(2) print("Small inference completed on 1g.10gb MIG instance") --- # Pod 1: Large training workload (3g.40gb) apiVersion: v1 kind: Pod metadata: name: mig-large-training-pod namespace: mig-gpu labels: app: mig-large-training workload-type: training spec: restartPolicy: Never containers: - name: large-training-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "/scripts/large-training-script.py"] volumeMounts: - name: script-volume mountPath: /scripts readOnly: true resources: claims: - name: mig-large-claim resourceClaims: - name: mig-large-claim resourceClaimTemplateName: mig-large-template nodeSelector: node.kubernetes.io/instance-type: p4de.24xlarge nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule volumes: - name: script-volume configMap: name: mig-scripts-configmap defaultMode: 0755 --- # Pod 2: Medium training workload (2g.20gb) - can run on SAME GPU as Pod 1 apiVersion: v1 kind: Pod metadata: name: mig-medium-training-pod namespace: mig-gpu labels: app: mig-medium-training workload-type: training spec: restartPolicy: Never containers: - name: medium-training-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "/scripts/medium-training-script.py"] volumeMounts: - name: script-volume mountPath: /scripts readOnly: true resources: claims: - name: mig-medium-claim resourceClaims: - name: mig-medium-claim resourceClaimTemplateName: mig-medium-template nodeSelector: node.kubernetes.io/instance-type: p4de.24xlarge nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule volumes: - name: script-volume configMap: name: mig-scripts-configmap defaultMode: 0755 --- # Pod 3: Small inference workload (1g.10gb) - can run on SAME GPU as Pod 1 & 2 apiVersion: v1 kind: Pod metadata: name: mig-small-inference-pod namespace: mig-gpu labels: app: mig-small-inference workload-type: inference spec: restartPolicy: Never containers: - name: small-inference-container image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["python", "/scripts/small-inference-script.py"] volumeMounts: - name: script-volume mountPath: /scripts readOnly: true resources: claims: - name: mig-small-claim resourceClaims: - name: mig-small-claim resourceClaimTemplateName: mig-small-template nodeSelector: node.kubernetes.io/instance-type: p4de.24xlarge nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule volumes: - name: script-volume configMap: name: mig-scripts-configmap defaultMode: 0755
-
Apply this spec, which should deploy three Pods:
kubctl apply -f mig-pod.yaml
These Pods should be scheduled by the DRA driver.
-
Check DRA driver Pod logs and you will see output similar to this:
I0717 21:50:22.925811 1 driver.go:87] NodePrepareResource is called: number of claims: 1 I0717 21:50:22.932499 1 driver.go:129] Returning newly prepared devices for claim '933e9c72-6fd6-49c5-933c-a896407dc6d1': [&Device{RequestNames:[mig-large],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-0-mig-9-4-4,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-0-mig-9-4-4**],}] I0717 21:50:23.186472 1 driver.go:87] NodePrepareResource is called: number of claims: 1 I0717 21:50:23.191226 1 driver.go:129] Returning newly prepared devices for claim '61e5ddd2-8c2e-4c19-93ae-d317fecb44a4': [&Device{RequestNames:[mig-medium],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-2-mig-14-0-2,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-2-mig-14-0-2**],}] I0717 21:50:23.450024 1 driver.go:87] NodePrepareResource is called: number of claims: 1 I0717 21:50:23.455991 1 driver.go:129] Returning newly prepared devices for claim '1eda9b2c-2ea6-401e-96d0-90e9b3c111b5': [&Device{RequestNames:[mig-small],PoolName:ip-100-64-173-145.ec2.internal,DeviceName:gpu-1-mig-19-2-1,CDIDeviceIDs:[k8s.gpu.nvidia.com/device=**gpu-1-mig-19-2-1**],}]
-
Verify the
resourceclaims
to see the Pod status:kubectl get resourceclaims -n mig-gpu -w
The following is example output:
NAME STATE AGE mig-large-training-pod-mig-large-claim-6dpn8 pending 0s mig-large-training-pod-mig-large-claim-6dpn8 pending 0s mig-large-training-pod-mig-large-claim-6dpn8 allocated,reserved 0s mig-medium-training-pod-mig-medium-claim-bk596 pending 0s mig-medium-training-pod-mig-medium-claim-bk596 pending 0s mig-medium-training-pod-mig-medium-claim-bk596 allocated,reserved 0s mig-small-inference-pod-mig-small-claim-d2t58 pending 0s mig-small-inference-pod-mig-small-claim-d2t58 pending 0s mig-small-inference-pod-mig-small-claim-d2t58 allocated,reserved 0s
As you can see, all the Pods moved from pending to
allocated,reserved
by the DRA driver. -
Run
nvidia-smi
from the node. You will notice three Python processors are running:root@ip-100-64-173-145 bin]# nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:10:1C.0 Off | On | | N/A 63C P0 127W / 400W | 569MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:10:1D.0 Off | On | | N/A 56C P0 121W / 400W | 374MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:20:1C.0 Off | On | | N/A 63C P0 128W / 400W | 467MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:20:1D.0 Off | On | | N/A 57C P0 118W / 400W | 249MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:90:1C.0 Off | 0 | | N/A 51C P0 77W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:90:1D.0 Off | 0 | | N/A 46C P0 69W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:A0:1C.0 Off | 0 | | N/A 52C P0 74W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:A0:1D.0 Off | 0 | | N/A 47C P0 72W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | 0 2 0 0 | 428MiB / 40192MiB | 42 0 | 3 0 2 0 0 | | | 2MiB / 32767MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 0 3 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 0 9 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 0 10 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 1 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 1 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 1 13 0 2 | 161MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 2MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 1 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 2 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 2 5 0 1 | 289MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 2MiB / 16383MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 2 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 2 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 3 1 0 0 | 107MiB / 40192MiB | 42 0 | 3 0 2 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 3 5 0 1 | 71MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 3 13 0 2 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 3 14 0 3 | 36MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| **| 0 2 0 64080 C python 312MiB | | 1 13 0 64085 C python 118MiB | | 2 5 0 64073 C python 210MiB |** +-----------------------------------------------------------------------------------------+
Optimize GPU workloads with IMEX using GB200 P6e instances
IMEX (Internode Memory Exchange) enables memory-coherent communication across nodes for distributed training on NVIDIA GB200 UltraServers.
Do the following steps.
-
Define a
ComputeDomain
for multi-node training with a file namedimex-compute-domain.yaml
:apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: distributed-training-domain namespace: default spec: numNodes: 2 channel: resourceClaimTemplate: name: imex-channel-template
-
Define a Pod using IMEX channels with a file named
imex-pod.yaml
:apiVersion: v1 kind: Pod metadata: name: imex-distributed-training namespace: default labels: app: imex-training spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.clique operator: Exists containers: - name: distributed-training image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["bash", "-c"] args: - | echo "=== IMEX Channel Verification ===" ls -la /dev/nvidia-caps-imex-channels/ echo "" echo "=== GPU Information ===" nvidia-smi echo "" echo "=== NCCL Test (if available) ===" python -c " import torch import torch.distributed as dist import os print(f'CUDA available: {torch.cuda.is_available()}') print(f'CUDA device count: {torch.cuda.device_count()}') if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): print(f'GPU {i}: {torch.cuda.get_device_name(i)}') # Check for IMEX environment variables imex_vars = [k for k in os.environ.keys() if 'IMEX' in k or 'NVLINK' in k] if imex_vars: print('IMEX Environment Variables:') for var in imex_vars: print(f' {var}={os.environ[var]}') print('IMEX channel verification completed') " # Keep container running for inspection sleep 3600 resources: claims: - name: imex-channel-0 - name: imex-channel-1 resourceClaims: - name: imex-channel-0 resourceClaimTemplateName: imex-channel-template - name: imex-channel-1 resourceClaimTemplateName: imex-channel-template tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
Note
This requires P6e GB200 instances.
-
Deploy IMEX by applying the
ComputeDomain
and templates:kubectl apply -f imex-claim-template.yaml kubectl apply -f imex-compute-domain.yaml kubectl apply -f imex-pod.yaml
-
Check the
ComputeDomain
status.kubectl get computedomain distributed-training-domain
-
Monitor the IMEX daemon deployment.
kubectl get pods -n nvidia-dra-driver -l resource.nvidia.com/computeDomain
-
Check the IMEX channels in the Pod:
kubectl exec imex-distributed-training -- ls -la /dev/nvidia-caps-imex-channels/
-
View the Pod logs:
kubectl logs imex-distributed-training
The following is an example of expected output:
=== IMEX Channel Verification === total 0 drwxr-xr-x. 2 root root 80 Jul 8 10:45 . drwxr-xr-x. 6 root root 380 Jul 8 10:45 .. crw-rw-rw-. 1 root root 241, 0 Jul 8 10:45 channel0 crw-rw-rw-. 1 root root 241, 1 Jul 8 10:45 channel1
For more information, see the
NVIDIA
example