Creating a HyperPod EKS cluster with restricted instance group (RIG)
This topic covers the steps to create a Amazon SageMaker HyperPod EKS cluster with a restricted instance group (RIG). A RIG configuration in SageMaker HyperPod EKS clusters provides a specialized environment for training Amazon Nova models. RIG has the following restrictions:
-
RIG workloads run in an internet-free VPC, all ingress and egress are strictly regulated.
-
RIG has restrictions on the observability of Kubernetes functions such as Kubectl exec and logs to ensure a secured environment for Nova model training.
-
RIG only allows Nova customization images, and jobs running with other images will be denied.
You can create RIGs when setting up instance groups in your HyperPod EKS cluster. While you can control the size and scaling of these resources, you cannot directly access the worker nodes. This architecture ensures Nova components (model weights, checkpoints, training data, and code) are only accessible through regulated channels and a service-managed account system.
Nova model customization on SageMaker HyperPod relies on a service-managed FSx for Lustre file system to achieve optimal performance. When creating a RIG, you must specify the volume size and throughput for the FSx for Lustre file system, which will be mounted to all worker nodes in the instance group. FSx for Lustre is used to store intermediate checkpoints and internal model states during distributed training. Follow the guidance provided in the recipe to choose an appropriate volume size and throughput to ensure sufficient capacity and performance. FSx for Lustre usage costs will apply to your AWS account.
Important notes for RIG in HyperPod EKS clusters
-
RIG supports only the use of the execution role for permissions. Ensure that the execution role includes the necessary IAM permissions, such as access to Amazon S3.
-
When using service-managed Amazon FSx for Lustre and Amazon S3, ensure that your FSx for Lustre file system is appropriately sized for your workload. The training data manifest is uploaded to Amazon S3, which must be accessible by the execution role.
-
RIG must be created or updated on a new SageMaker HyperPod EKS cluster-specifically, one created on or after July 16, 2025. Clusters created before this date might contain incompatible software versions or configurations that are not supported by RIG.
Create a HyperPod EKS cluster with RIG (Console)
Follow these instructions to create a HyperPod EKS cluster with a RIG using the HyperPod console.
Create a HyperPod EKS cluster with RIG (CLI)
Follow these instructions to create a HyperPod EKS cluster with a RIG using the AWS CLI.