Using GPU partitions in Amazon SageMaker HyperPod
Cluster administrators can choose how to maximize GPU utilization across their organization. You can enable GPU partitioning with NVIDIA Multi-Instance GPU (MIG) technology to partition GPU resources into smaller, isolated instances for better resource utilization. This capability provides the ability to run multiple smaller sized tasks concurrently on a single GPU instead of dedicating the entire hardware to a single, often underutilized task. This eliminates wasted compute power and memory.
GPU partitioning with MIG technology supports GPUs and allows you to partition a single supported GPU into up to seven separate GPU partitions. Each GPU partition has dedicated memory, cache, and compute resources, providing predictable isolation.
Benefits
-
Improved GPU utilization - Maximize compute efficiency by partitioning GPUs based on compute and memory requirements
-
Task isolation - Each GPU partition operates independently with dedicated memory, cache, and compute resources
-
Task flexibility - Support a mix of tasks on a single physical GPU, all running in parallel
-
Flexible setup management - Support both Do-it-yourself (DIY) Kubernetes configurations using Kubernetes command-line client
kubectl, and a managed solution with custom labels to easily configure and apply your labels associated with GPU partitions
Supported Instance Types
GPU partitioning with MIG technology is supported on the following HyperPod instance types:
A100 GPU Instances - https://aws.amazon.com/ec2/instance-types/p4/
-
ml.p4d.24xlarge - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)
H100 GPU Instances - https://aws.amazon.com/ec2/instance-types/p5/
-
ml.p5.48xlarge - 8 NVIDIA H100 GPUs (80GB HBM3 per GPU)
H200 GPU Instances - https://aws.amazon.com/ec2/instance-types/p5/
-
ml.p5e.48xlarge - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)
-
ml.p5en.48xlarge - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)
GPU partitions
NVIDIA MIG profiles define how GPUs are partitioned. Each profile specifies the compute and memory allocation per MIG instance. The following are the MIG profiles associated with each GPU type:
A100 GPU (ml.p4d.24xlarge)
| Profile | Memory (GB) | Instances per GPU | Total per ml.p4d.24xlarge |
|---|---|---|---|
|
5 |
7 |
56 |
|
10 |
3 |
24 |
|
20 |
2 |
16 |
|
20 |
1 |
8 |
|
40 |
1 |
8 |
H100 GPU (ml.p5.48xlarge)
| Profile | Memory (GB) | Instances per GPU | Total per ml.p5.48xlarge |
|---|---|---|---|
|
10 |
7 |
56 |
|
20 |
4 |
32 |
|
20 |
3 |
24 |
|
40 |
2 |
16 |
|
40 |
1 |
8 |
|
80 |
1 |
8 |
H200 GPU (ml.p5e.48xlarge and ml.p5en.48xlarge)
| Profile | Memory (GB) | Instances per GPU | Total per ml.p5en.48xlarge |
|---|---|---|---|
|
18 |
7 |
56 |
|
35 |
4 |
32 |
|
35 |
3 |
24 |
|
71 |
2 |
16 |
|
71 |
1 |
8 |
|
141 |
1 |
8 |