

# Deploying models on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-model-deployment"></a>

Amazon SageMaker HyperPod now extends beyond training to deliver a comprehensive inference platform that combines the flexibility of Kubernetes with the operational excellence of AWS managed services. Deploy, scale, and optimize your machine learning models with enterprise-grade reliability using the same HyperPod compute throughout the entire model lifecycle.

Amazon SageMaker HyperPod offers flexible deployment interfaces that allow you to deploy models through multiple methods including kubectl, Python SDK, Amazon SageMaker Studio UI, or HyperPod CLI. The service provides advanced autoscaling capabilities with dynamic resource allocation that automatically adjusts based on demand. Additionally, it includes comprehensive observability and monitoring features that track critical metrics such as time-to-first-token, latency, and GPU utilization to help you optimize performance.

**Note**  
When deploying on GPU-enabled instances, you can use GPU partitioning with Multi-Instance GPU (MIG) technology to run multiple inference workloads on a single GPU. This allows for better GPU utilization and cost optimization. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

**Unified infrastructure for training and inference**

Maximize your GPU utilization by seamlessly transitioning compute resources between training and inference workloads. This reduces the total cost of ownership while maintaining operational continuity.

**Enterprise-ready deployment options**

Deploy models from multiple sources including open-weights and gated models from Amazon SageMaker JumpStart and custom models from Amazon S3 and Amazon FSx with support for both single-node and multi-node inference architectures.

**Managed tiered Key-value (KV) caching and intelligent routing**

KV caching saves the precomputed key-value vectors after processing previous tokens. When the next token is processed, the vectors don't need to be recalculated. Through a two-tier caching architecture, you can configure an L1 cache that uses CPU memory for low-latency local reuse, and an L2 cache that leverages Redis to enable scalable, node-level cache sharing.

Intelligent routing analyzes incoming requests and directs them to the inference instance most likely to have relevant cached key-value pairs. The system examines the request and then routes it based on one of the following routing strategies:

1. `prefixaware` — Subsequent requests with the same prompt prefix are routed to the same instance

1. `kvaware` — Incoming requests are routed to the instance with the highest KV cache hit rate.

1. `session` — Requests from the same user session are routed to the same instance.

1. `roundrobin` — Even distribution of requests without considering the state of the KV cache.

For more information on how to enable this feature, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route).

**Inbuilt L2 cache Tiered Storage Support for KV Caching**

Building upon the existing KV cache infrastructure, HyperPod now integrates tiered storage as an additional L2 backend option alongside Redis. With the inbuilt SageMaker managed tiered storage, this offers improved performance. This enhancement provides customers with a more scalable and efficient option for cache offloading, particularly beneficial for high-throughput LLM inference workloads. The integration maintains compatibility with existing vLLM model servers and routing capabilities while offering better performance.

**Note**  
**Data encryption:** KV cache data (attention keys and values) is stored unencrypted at rest to optimize inference latency and improve performance. For workloads with strict encryption-at-rest requirements, consider application-layer encryption of prompts and responses, or disable caching.  
**Data isolation:** When using managed tiered storage as the L2 cache backend, multiple inference deployments within a cluster share cache storage with no isolation. L2 KV cache data (attention keys and values) from different deployments is not separated. For workloads requiring data isolation (multi-tenant scenarios, different data classification levels), deploy to separate clusters or use dedicated Redis instances.

**Multi-instance type deployment with automatic failover**

HyperPod Inference supports multi-instance type deployment to improve deployment reliability and resource utilization. Specify a prioritized list of instance types in your deployment configuration, and the system automatically selects from available alternatives when your preferred instance type lacks capacity. The Kubernetes scheduler uses `preferredDuringSchedulingIgnoredDuringExecution` node affinity to evaluate instance types in priority order, placing workloads on the highest-priority available instance type while ensuring deployment even when preferred resources are unavailable. This capability prevents deployment failures due to capacity constraints while maintaining your cost and performance preferences, ensuring continuous service availability even during cluster capacity fluctuations.

**Custom node affinity for granular scheduling control**

HyperPod Inference supports custom node affinity to control workload placement beyond instance type selection. Specify node selection criteria such as availability zone distribution, capacity type filtering (on-demand vs. spot), or custom node labels through the `nodeAffinity` field. The system supports mandatory placement constraints using `requiredDuringSchedulingIgnoredDuringExecution` and optional preferences through `preferredDuringSchedulingIgnoredDuringExecution`, providing full control over pod scheduling decisions while maintaining deployment flexibility.

**Note**  
We collect certain routine operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model inference workload. These metrics relate to deployment operations, resource management, and endpoint registration.

**Topics**
+ [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md)
+ [Deploy foundation models and custom fine-tuned models](sagemaker-hyperpod-model-deployment-deploy.md)
+ [Autoscaling policies for your HyperPod inference model deployment](sagemaker-hyperpod-model-deployment-autoscaling.md)
+ [Implementing inference observability on HyperPod clusters](sagemaker-hyperpod-model-deployment-observability.md)
+ [Task governance for model deployment on HyperPod](sagemaker-hyperpod-model-deployment-task-gov.md)
+ [HyperPod inference troubleshooting](sagemaker-hyperpod-model-deployment-ts.md)
+ [Amazon SageMaker HyperPod Inference release notes](sagemaker-hyperpod-inference-release-notes.md)

# Setting up your HyperPod clusters for model deployment
<a name="sagemaker-hyperpod-model-deployment-setup"></a>

This guide shows you how to enable inference capabilities on Amazon SageMaker HyperPod clusters. You'll set up the infrastructure, permissions, and operators that machine learning engineers need to deploy and manage inference endpoints.

**Note**  
To create a cluster with the inference operator pre-installed, see [Create an EKS-orchestrated SageMaker HyperPod cluster](sagemaker-hyperpod-quickstart.md#sagemaker-hyperpod-quickstart-eks). To install the inference operator on an existing cluster, continue with the following procedures.

You can install the inference operator using the SageMaker AI console for a streamlined experience, or use the AWS CLI for more control. This guide covers both installation methods.

## Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)
<a name="sagemaker-hyperpod-model-deployment-setup-ui"></a>

The SageMaker AI console provides the most streamlined experience with two installation options:
+ **Quick Install:** Automatically creates all required resources with optimized defaults, including IAM roles, Amazon S3 buckets, and dependency add-ons. A new Studio domain will be created with required permissions to deploy a JumpStart model to the relevant cluster. This option is ideal for getting started quickly with minimal configuration decisions.
+ **Custom Install:** Provides flexibility to specify existing resources or customize configurations while maintaining the one-click experience. Customers can choose to reuse existing IAM roles, Amazon S3 buckets, or dependency add-ons based on their organizational requirements.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-ui-prereqs"></a>
+ An existing HyperPod cluster with Amazon EKS orchestration
+ IAM permissions for Amazon EKS cluster administration
+ kubectl configured for cluster access

### Installation steps
<a name="sagemaker-hyperpod-model-deployment-setup-ui-steps"></a>

1. Navigate to the SageMaker AI console and go to **HyperPod Clusters** → **Cluster Management**.

1. Select your cluster where you want to install the Inference Operator.

1. Navigate to the **Inference** tab. Select **Quick Install** for automated setup or **Custom Install** for configuration flexibility.

1. If choosing Custom Install, specify existing resources or customize settings as needed.

1. Click **Install** to begin the automated installation process.

1. Verify the installation status through the console, or by running the following commands:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   ```
   aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION
   ```

After the add-on is successfully installed, you can deploy models using the model deployment documentation or navigate to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

## Method 2: Installing the Inference Operator using the AWS CLI
<a name="sagemaker-hyperpod-model-deployment-setup-addon"></a>

The AWS CLI installation method provides more control over the installation process and is suitable for automation and advanced configurations.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq-addon"></a>

The inference operator enables deployment and management of machine learning inference endpoints on your Amazon EKS cluster. Before installation, ensure your cluster has the required security configurations and supporting infrastructure. Complete these steps to configure IAM roles, install the AWS Load Balancer Controller, set up Amazon S3 and Amazon FSx CSI drivers, and deploy KEDA and cert-manager:

1. [Connect to your cluster and set up environment variables](#sagemaker-hyperpod-model-deployment-setup-connect-addon)

1. [Configure IAM roles for inference operator](#sagemaker-hyperpod-model-deployment-setup-prepare-addon)

1. [Create the ALB Controller role](#sagemaker-hyperpod-model-deployment-setup-alb-addon)

1. [Create the KEDA operator role](#sagemaker-hyperpod-model-deployment-setup-keda-addon)

1. [Install the dependency EKS Add-Ons](#sagemaker-hyperpod-model-deployment-setup-install-dependencies)

**Note**  
Alternatively, you can use CloudFormation templates to automate the prerequisite setup. For more information, see [Using CloudFormation templates to create the prerequisite stack](#sagemaker-hyperpod-model-deployment-setup-cfn).

### Connect to your cluster and set up environment variables
<a name="sagemaker-hyperpod-model-deployment-setup-connect-addon"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. Run the following steps using an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Ensure you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md). Install helm, eksctl, and kubectl command line utilities.

For Kubernetes administrative access to the Amazon EKS cluster, open the Amazon EKS console and select your cluster. In the **Access** tab, select **IAM Access Entries**. If no entry exists for your IAM principal, select **Create Access Entry**. Select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-*
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```
**Note**  
If using a custom bucket name that doesn't start with `hyperpod-tls-`, attach the following policy to your execution role:  

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "TLSBucketDeleteObjectsPermission",
               "Effect": "Allow",
               "Action": ["s3:DeleteObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Sid": "TLSBucketGetObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:GetObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"]
           },
           {
               "Sid": "TLSBucketPutObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:PutObject", "s3:PutObjectTagging"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           }
       ]
   }
   ```

1. Set default env variables.

   ```
   HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   # Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Configure IAM roles for inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-prepare-addon"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. These policies enable secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Service": [
                   "sagemaker.amazonaws.com"
               ]
           },
           "Action": "sts:AssumeRole"
       },
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
               }
           }
       }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Create a namespace for inference operator resources

   ```
   kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE
   ```

### Create the ALB Controller role
<a name="sagemaker-hyperpod-model-deployment-setup-alb-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/alb-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   
   # Create the role
   aws iam create-role \
       --role-name alb-role \
       --assume-role-policy-document file:///tmp/alb-trust-policy.json 
   
   # Create the policy
   ALB_POLICY_ARN=$(aws iam create-policy \
       --policy-name $ALBController_IAM_POLICY_NAME \
       --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name alb-role \
       --policy-arn $ALB_POLICY_ARN
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
       --region ${REGION} \
       --vpc-id ${VPC_ID} \
       --vpc-endpoint-type Gateway \
       --service-name "com.amazonaws.${REGION}.s3" \
       --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   
   # Create the role
   aws iam create-role \
       --role-name keda-operator-role \
       --assume-role-policy-document file:///tmp/keda-trust-policy.json
   
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
       --policy-name KedaOperatorPolicy \
       --policy-document file:///tmp/keda-policy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name keda-operator-role \
       --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create an IAM policy.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      ```

   1. Create an IAM role.

      ```
      # Create the role using existing trust policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the dependency EKS Add-Ons
<a name="sagemaker-hyperpod-model-deployment-setup-install-dependencies"></a>

Before installing the inference operator, you must install the following required EKS add-ons on your cluster. The inference operator will fail to install if any of these dependencies are missing. Each add-on has a minimum version requirement for compatibility with the Inference add-on.

**Important**  
Install all dependency add-ons before attempting to install the inference operator. Missing dependencies will cause installation failures with specific error messages.

#### Required Add-ons
<a name="sagemaker-hyperpod-model-deployment-setup-required-addons"></a>

1. **Amazon S3 Mountpoint CSI Driver** (minimum version: v1.14.1-eksbuild.1)

   Required for mounting S3 buckets as persistent volumes in inference workloads.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --region $REGION \
       --service-account-role-arn $S3_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Mountpoint for Amazon S3 CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#mountpoint-for-s3-add-on).

1. **Amazon FSx CSI Driver** (minimum version: v1.6.0-eksbuild.1)

   Required for mounting FSx file systems for high-performance model storage.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --region $REGION \
       --service-account-role-arn $FSX_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Amazon FSx for Lustre CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-aws-fsx-csi-driver).

1. **Metrics Server** (minimum version: v0.7.2-eksbuild.4)

   Required for autoscaling functionality and resource metrics collection.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name metrics-server \
       --region $REGION
   ```

   For detailed installation instructions, see [Metrics Server](https://docs.aws.amazon.com/eks/latest/userguide/metrics-server.html).

1. **Cert Manager** (minimum version: v1.18.2-eksbuild.2)

   Required for TLS certificate management for secure inference endpoints.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

   For detailed installation instructions, see [cert-manager](https://docs.aws.amazon.com/eks/latest/userguide/community-addons.html#addon-cert-manager).

#### Verify Add-on Installation
<a name="sagemaker-hyperpod-model-deployment-setup-verify-dependencies"></a>

After installing the required add-ons, verify they are running correctly:

```
# Check add-on status
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

All add-ons should show status "ACTIVE" and all pods should be in "Running" state before proceeding with inference operator installation.

**Note**  
If you created your HyperPod cluster using the quick setup or custom setup options, the FSx CSI Driver and Cert Manager may already be installed. Verify their presence using the commands above.

### Installing the Inference Operator with EKS add-on
<a name="sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon"></a>

The EKS add-on installation method provides a managed experience with automatic updates and integrated dependency validation. This is the recommended approach for installing the inference operator.

**Install the inference operator add-on**

1. Prepare the add-on configuration by gathering all required ARNs and creating the configuration file:

   ```
   # Gather required ARNs
   export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text)
   export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text)
   export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text)
   
   # Verify all ARNs are set correctly
   echo "Execution Role ARN: $EXECUTION_ROLE_ARN"
   echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN"
   echo "KEDA Role ARN: $KEDA_ROLE_ARN"
   echo "ALB Role ARN: $ALB_ROLE_ARN"
   echo "TLS S3 Bucket: $BUCKET_NAME"
   ```

1. Create the add-on configuration file with all required settings:

   ```
   cat > addon-config.json << EOF
   {
     "executionRoleArn": "$EXECUTION_ROLE_ARN",
     "tlsCertificateS3Bucket": "$BUCKET_NAME",
     "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
     "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN",
     "alb": {
       "serviceAccount": {
         "create": true,
         "roleArn": "$ALB_ROLE_ARN"
       }
     },
     "keda": {
       "auth": {
         "aws": {
           "irsa": {
             "roleArn": "$KEDA_ROLE_ARN"
           }
         }
       }
     }
   }
   EOF
   
   # Verify the configuration file
   cat addon-config.json
   ```

1. Install the inference operator add-on (minimum version: v1.0.0-eksbuild.1):

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation progress and verify successful completion:

   ```
   # Check installation status (repeat until status shows "ACTIVE")
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify pods are running
   kubectl get pods -n hyperpod-inference-system
   
   # Check operator logs for any issues
   kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
   ```

For detailed troubleshooting of installation issues, see [HyperPod inference troubleshooting](sagemaker-hyperpod-model-deployment-ts.md).

To verify the inference operator is working correctly, continue to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

### Using CloudFormation templates to create the prerequisite stack
<a name="sagemaker-hyperpod-model-deployment-setup-cfn"></a>

As an alternative to manually configuring the prerequisites, you can use CloudFormation templates to automate the creation of required IAM roles and policies for the inference operator.

1. Set up input variables. Replace the placeholder values with your own:

   ```
   #!/bin/bash
   set -e
   
   # ===== INPUT VARIABLES =====
   HP_CLUSTER_NAME="my-hyperpod-cluster"  # Replace with your HyperPod cluster name
   REGION="us-east-1"  # Replace with your AWS region
   PREFIX="my-prefix"  # Replace with your resource prefix
   SHORT_PREFIX="12a34d56"  # Replace with your short prefix (maximum 8 characters)
   CREATE_DOMAIN="true"  # Set to "false" if you don't need a SageMaker Studio domain
   STACK_NAME="hyperpod-inference-prerequisites"  # Replace with your stack name
   TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml"
   ```

1. Derive cluster and network information:

   ```
   # ===== DERIVE EKS CLUSTER NAME =====
   EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}')
   echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME"
   
   # ===== GET VPC AND OIDC =====
   VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   echo "VPC_ID=$VPC_ID"
   
   OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||')
   echo "OIDC_PROVIDER=$OIDC_PROVIDER"
   
   # ===== GET PRIVATE ROUTE TABLES =====
   ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text)
   EKS_PRIVATE_ROUTE_TABLES=""
   for rtb in $ALL_ROUTE_TABLES; do
       HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null)
       if [ -z "$HAS_IGW" ]; then
           EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb"
       fi
   done
   echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES"
   
   # ===== CHECK S3 VPC ENDPOINT =====
   S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text)
   CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false")
   echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK"
   
   # ===== GET HYPERPOD DETAILS =====
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text)
   echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN"
   
   # ===== GET DEFAULT VPC FOR DOMAIN =====
   DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text)
   echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID"
   
   DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text)
   echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS"
   
   # ===== GET INSTANCE GROUPS =====
   INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')")
   echo "INSTANCE_GROUPS=$INSTANCE_GROUPS"
   ```

1. Create parameters file and deploy stack:

   ```
   # ===== CREATE PARAMETERS JSON =====
   cat > /tmp/cfn-params.json << EOF
   [
     {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"},
     {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"},
     {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"},
     {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"},
     {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"},
     {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"},
     {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"},
     {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"},
     {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"},
     {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"},
     {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"},
     {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"},
     {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"},
     {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"}
   ]
   EOF
   
   echo -e "\n===== CREATING CLOUDFORMATION STACK ====="
   aws cloudformation create-stack \
       --region $REGION \
       --stack-name $STACK_NAME \
       --template-url $TEMPLATE_URL \
       --parameters file:///tmp/cfn-params.json \
       --capabilities CAPABILITY_NAMED_IAM
   ```

1. Monitor the stack creation status:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].StackStatus'
   ```

1. Once the stack is created successfully, retrieve the output values for use in the inference operator installation:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].Outputs'
   ```

After the CloudFormation stack is created, continue with [Installing the Inference Operator with EKS add-on](#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon) to install the inference operator.

## Method 3: Helm chart installation
<a name="sagemaker-hyperpod-model-deployment-setup-helm"></a>

**Note**  
For the simplest installation experience, we recommend using [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](#sagemaker-hyperpod-model-deployment-setup-ui) or [Method 2: Installing the Inference Operator using the AWS CLI](#sagemaker-hyperpod-model-deployment-setup-addon). Helm chart installation may be deprecated in a future release.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. The following steps need to be run by an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Verify that you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) . Verify you have installed helm, eksctl, and kubectl command line utilities. 

For Kubernetes administrative access to the Amazon EKS cluster, go to the Amazon EKS console and select the cluster you are using. Look in the **Access** tab and select IAM Access Entries. If there isn't an entry for your IAM principal, select **Create Access Entry**. Then select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Set default env variables.

   ```
   LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME"
   LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME"
   S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME"
   S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller"
   HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system"
   JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME"
   FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   #Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Prepare your environment for inference operator installation
<a name="sagemaker-hyperpod-model-deployment-setup-prepare"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. This policy enables secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
   {
       "Effect": "Allow",
       "Principal": {
           "Service": [
               "sagemaker.amazonaws.com"
           ]
       },
       "Action": "sts:AssumeRole"
   },
   {
       "Effect": "Allow",
       "Principal": {
           "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
       },
       "Action": "sts:AssumeRoleWithWebIdentity",
       "Condition": {
           "StringLike": {
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
           }
       }
   }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator and attach the managed policy.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Download and create the IAM policy required for the AWS Load Balancer Controller to manage Application Load Balancers and Network Load Balancers in your EKS cluster.

   ```
   %%bash -x 
   
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
   ```

1. Create an IAM service account that links the Kubernetes service account with the IAM policy, enabling the AWS Load Balancer Controller to assume the necessary AWS permissions through IRSA (IAM Roles for Service Accounts).

   ```
   %%bash -x 
   
   export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME"
   
   # Create IAM service account with gathered values
   eksctl create iamserviceaccount \
   --approve \
   --override-existing-serviceaccounts \
   --name=aws-load-balancer-controller \
   --namespace=kube-system \
   --cluster=$EKS_CLUSTER_NAME \
   --attach-policy-arn=$ALB_POLICY_ARN \
   --region=$REGION
   
   # Print the values for verification
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Region: $REGION"
   echo "Policy ARN: $ALB_POLICY_ARN"
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create a Namespace for KEDA and the Cert Manager.

   ```
   kubectl create namespace keda
   kubectl create namespace cert-manager
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
   --vpc-id ${VPC_ID} \
   --vpc-endpoint-type Gateway \
   --service-name "com.amazonaws.${REGION}.s3" \
   --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

1. Configure S3 storage access:

   1. Create an IAM policy that grants the necessary S3 permissions for using Mountpoint for Amazon S3, which enables file system access to S3 buckets from within the cluster.

      ```
      %%bash -x
      
      export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>”
      
      cat <<EOF> s3accesspolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          
          {
              "Sid": "MountpointAccess",
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetObject",
                  "s3:PutObject",
                  "s3:AbortMultipartUpload",
                  "s3:DeleteObject"
              ],
              "Resource": [
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}",
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*"
              ]
          }
      ]
      }
      EOF
      
      aws iam create-policy \
      --policy-name S3MountpointAccessPolicy \
      --policy-document file://s3accesspolicy.json
      
      cat <<EOF> s3accesstrustpolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringEquals": {
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}"
                  }
              }
          }
      ]
      }
      EOF
      
      aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json
      
      aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
      ```

   1. (Optional) Create an IAM service account for the Amazon S3 CSI driver. The Amazon S3 CSI driver requires an IAM service account with appropriate permissions to mount S3 buckets as persistent volumes in your Amazon EKS cluster. This step creates the necessary IAM role and Kubernetes service account with the required S3 access policy.

      ```
      %%bash -x 
      
      export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION"
      export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' |  tr -d '"')
      
      eksctl create iamserviceaccount \
      --name s3-csi-driver-sa \
      --namespace kube-system \
      --cluster $EKS_CLUSTER_NAME \
      --attach-policy-arn $S3_CSI_POLICY_ARN \
      --approve \
      --role-name $S3_CSI_ROLE_NAME \
      --region $REGION 
      
      kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
      ```

   1. (Optional) Install the Amazon S3 CSI driver add-on. This driver enables your pods to mount S3 buckets as persistent volumes, providing direct access to S3 storage from within your Kubernetes workloads.

      ```
      %%bash -x
      
      export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME  --query 'Role.Arn' --output text)
      eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
      ```

   1. (Optional) Create a Persistent Volume Claim (PVC) for S3 storage. This PVC enables your pods to request and use S3 storage as if it were a traditional file system.

      ```
      %%bash -x 
      
      cat <<EOF> pvc_s3.yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
      name: s3-claim
      spec:
      accessModes:
      - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
      storageClassName: "" # required for static provisioning
      resources:
      requests:
          storage: 1200Gi # ignored, required
      volumeName: s3-pv
      EOF
      
      kubectl apply -f pvc_s3.yaml
      ```

1. (Optional) Configure FSx storage access. Create an IAM service account for the Amazon FSx CSI driver. This service account will be used by the FSx CSI driver to interact with the Amazon FSx service on behalf of your cluster.

   ```
   %%bash -x 
   
   
   eksctl create iamserviceaccount \
   --name fsx-csi-controller-sa \
   --namespace kube-system \
   --cluster $EKS_CLUSTER_NAME \
   --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
   --approve \
   --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \
   --region $REGION
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   # Create the role
   aws iam create-role \
   --role-name keda-operator-role \
   --assume-role-policy-document file:///tmp/keda-trust-policy.json
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
   --policy-name KedaOperatorPolicy \
   --policy-document file:///tmp/keda-policy.json \
   --query 'Policy.Arn' \
   --output text)
   # Attach the policy to the role
   aws iam attach-role-policy \
   --role-name keda-operator-role \
   --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create the trust policy and IAM role for gated model access.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      
      # Create the role and attach the managed policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-install"></a>

1. Install the HyperPod inference operator. This step gathers the required AWS resource identifiers and generates the Helm installation command with the appropriate configuration parameters.

   Access the helm chart from [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart) .

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli
   cd sagemaker-hyperpod-cli
   cd helm_chart/HyperPodHelmChart
   helm dependencies update charts/inference-operator
   ```

   ```
   %%bash -x
   
   HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   echo $HYPERPOD_INFERENCE_ROLE_ARN
   
   S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text)
   echo $S3_CSI_ROLE_ARN
   
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn")
   
   # Verify values
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN"
   echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN"
   # Run the the HyperPod inference operator installation. 
   
   helm install hyperpod-inference-operator charts/inference-operator \
   -n kube-system \
   --set region=$REGION \
   --set eksClusterName=$EKS_CLUSTER_NAME \
   --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \
   --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \
   --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \
   --set s3.node.serviceAccount.create=false \
   --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \
   --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \
   --set alb.region=$REGION \
   --set alb.clusterName=$EKS_CLUSTER_NAME \
   --set alb.vpcId=$VPC_ID
   
   # For JumpStart Gated Model usage, Add
   # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
   ```

1. Configure the service account annotations for IAM integration. This annotation enables the operator's service account to assume the necessary IAM permissions for managing inference endpoints and interacting with AWS services.

   ```
   %%bash -x 
   
   EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///')
   
   # Annotate service account
   kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \
   -n hyperpod-inference-system \
   eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \
   --overwrite
   ```

## Verify the inference operator is working
<a name="sagemaker-hyperpod-model-deployment-setup-verify"></a>

Follow these steps to verify that your inference operator installation is working correctly by deploying and testing a simple model.

**Deploy a test model to verify the operator**

1. Create a model deployment configuration file. This creates a Kubernetes manifest file that defines a JumpStart model deployment for the HyperPod inference operator.

   ```
   cat <<EOF>> simple_model_install.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
   name: testing-deployment-bert
   namespace: default
   spec:
   model:
   modelId: "huggingface-eqa-bert-base-cased"
   sageMakerEndpoint:
   name: "hp-inf-ep-for-testing"
   server:
   instanceType: "ml.c5.2xlarge"
   environmentVariables:
   - name: SAMPLE_ENV_VAR
       value: "sample_value"
   maxDeployTimeInSeconds: 1800
   EOF
   ```

1. Deploy the model and clean up the configuration file.

   ```
   kubectl create -f simple_model_install.yaml
   rm -f simple_model_install.yaml
   ```

1. Verify the service account configuration to ensure the operator can assume AWS permissions.

   ```
   # Get the service account details
   kubectl get serviceaccount -n hyperpod-inference-system
   
   # Check if the service account has the AWS annotations
   kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
   ```

**Configure deployment settings (if using Studio UI)**

1. Review the recommended instance type under **Deployment settings**.

1. If modifying the **Instance type**, ensure compatibility with your HyperPod cluster. Contact your admin if compatible instances aren't available.

1. For GPU-partitioned instances with MIG enabled, select an appropriate **GPU partition** from available MIG profiles to optimize GPU utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. If using task governance, configure priority settings for model deployment preemption capabilities.

1. Enter the namespace provided by your admin. Contact your admin for the correct namespace if needed.

## (Optional) Set up user access through the JumpStart UI in SageMaker AI Studio Classic
<a name="sagemaker-hyperpod-model-deployment-setup-optional-js"></a>

For more background on setting up SageMaker HyperPod access for Studio Classic users and configuring fine-grained Kubernetes RBAC permissions for data scientist users, read [Setting up an Amazon EKS cluster in Studio](sagemaker-hyperpod-studio-setup-eks.md) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

1. Identify the IAM role that Data Scientist users will use to manage and deploy models to SageMaker HyperPod from SageMaker AI Studio Classic. This is typically the User Profile Execution Role or Domain Execution Role for the Studio Classic user.

   ```
   %%bash -x
   
   export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>"
   
   export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy"
   export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text)
   
   export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
   ```

1. Attach an Identity Policy enabling Model Deployment access.

   ```
   %%bash -x
   
   # Create access policy
   cat << EOF > hyperpod-deployment-ui-access-policy.json
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "DescribeHyerpodClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster"
               ],
               "Resource": "$HYPERPOD_CLUSTER_ARN"
           },
           {
               "Sid": "UseEksClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "eks:DescribeCluster",
                   "eks:AccessKubernetesApi",
                   "eks:MutateViaKubernetesApi",
                   "eks:DescribeAddon"
               ],
               "Resource": "$EKS_CLUSTER_ARN"
           },
           {
               "Sid": "ListPermission",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:ListClusters",
                   "sagemaker:ListEndpoints"
               ],
               "Resource": "*"
           },
           {
               "Sid": "SageMakerEndpointAccess",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeEndpoint",
                   "sagemaker:InvokeEndpoint"
               ],
               "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*"
           }
       ]
   }
   EOF
   
   aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
   ```

1. Create an EKS Access Entry for the user mapping them to a kubernetes group.

   ```
   %%bash -x
   
   aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \
       --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \
       --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
   ```

1. Create Kubernetes RBAC policies for the user.

   ```
   %%bash -x
   
   cat << EOF > cluster_level_config.yaml
   kind: ClusterRole
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     name: hyperpod-scientist-user-cluster-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["namespaces"]
     verbs: ["list"]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRoleBinding
   metadata:
     name: hyperpod-scientist-user-cluster-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-cluster-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: ClusterRole
     name: hyperpod-scientist-user-cluster-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f cluster_level_config.yaml
   
   
   cat << EOF > namespace_level_role.yaml
   kind: Role
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["create", "get"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/log"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/exec"]
     verbs: ["get", "create"]
   - apiGroups: ["kubeflow.org"]
     resources: ["pytorchjobs", "pytorchjobs/status"]
     verbs: ["get", "list", "create", "delete", "update", "describe"]
   - apiGroups: [""]
     resources: ["configmaps"]
     verbs: ["create", "update", "get", "list", "delete"]
   - apiGroups: [""]
     resources: ["secrets"]
     verbs: ["create", "get", "list", "delete"]
   - apiGroups: [ "inference.sagemaker.aws.amazon.com" ]
     resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ]
     verbs: [ "get", "list", "create", "delete", "update", "describe" ]
   - apiGroups: [ "autoscaling" ]
     resources: [ "horizontalpodautoscalers" ]
     verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-namespace-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: Role
     name: hyperpod-scientist-user-namespace-level-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f namespace_level_role.yaml
   ```

# Deploy foundation models and custom fine-tuned models
<a name="sagemaker-hyperpod-model-deployment-deploy"></a>

Whether you're deploying pre-trained foundation open-weights or gated models from Amazon SageMaker JumpStart or your own custom or fine-tuned models stored in Amazon S3 or Amazon FSx, SageMaker HyperPod provides the flexible, scalable infrastructure you need for production inference workloads.




****  

|  | Deploy open-weights and gated foundation models from JumpStart | Deploy custom and fine-tuned models from Amazon S3 and Amazon FSx | 
| --- | --- | --- | 
| Description |  Deploy from a comprehensive catalog of pre-trained foundation models with automatic optimization and scaling policies tailored to each model family.  | Bring your own custom and fine-tuned models and leverage SageMaker HyperPod's enterprise infrastructure for production-scale inference. Choose between cost-effective storage with Amazon S3 or a high-performance file system with Amazon FSx. | 
| Key benefits | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 
| Deployment options |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 

The following sections step you through deploying models from Amazon SageMaker JumpStart and from Amazon S3 and Amazon FSx.

**Topics**
+ [Deploy models from JumpStart using Amazon SageMaker Studio](sagemaker-hyperpod-model-deployment-deploy-js-ui.md)
+ [Deploy models from JumpStart using kubectl](sagemaker-hyperpod-model-deployment-deploy-js-kubectl.md)
+ [Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl](sagemaker-hyperpod-model-deployment-deploy-ftm.md)
+ [Deploy custom fine-tuned models using the Python SDK and HPCLI](deploy-trained-model.md) 
+ [Deploy models from Amazon SageMaker JumpStart using the Python SDK and HPCLI](deploy-jumpstart-model.md) 

# Deploy models from JumpStart using Amazon SageMaker Studio
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui"></a>

The following steps show you through how to deploy models from JumpStart using Amazon SageMaker Studio.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-prereqs"></a>

Verify that you've set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md). 

## Create a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-create"></a>

1. In Amazon SageMaker Studio, open the **JumpStart** landing page from the left navigation pane. 

1. Under **All public models**, choose a model you want to deploy.
**Note**  
If you’ve selected a gated model, you’ll have to accept the End User License Agreement (EULA).

1. Choose **SageMaker HyperPod**.

1. Under **Deployment settings**, JumpStart will recommend an instance for deployment. You can modify these settings if necessary.

   1. If you modify **Instance type**, ensure it’s compatible with the chosen **HyperPod cluster**. If there aren’t any compatible instances, you’ll need to select a new **HyperPod cluster **or contact your admin to add compatible instances to the cluster.

   1. To prioritize the model deployment, install the task governance addon, create compute allocations, and set up task rankings for the cluster policy. Once this is done, you should see an option to select a priority for the model deployment which can be used for preemption of other deployments and tasks on the cluster. 

   1. Enter the namespace to which your admin has provided you access. You may have to directly reach out to your admin to get the exact namespace. Once a valid namespace is provided, the **Deploy** button should be enabled to deploy the model.

   1. If your instance type is partitioned (MIG enabled), select a **GPU partition type**.

   1. If you want to enable L2 KVCache or Intelligent routing for speeding up LLM inference, enable them. By default, only L1 KV Cache is enabled. For more details on KVCache and Intelligent routing, see [SageMaker HyperPod model deployment](sagemaker-hyperpod-model-deployment.md).

1. Choose **Deploy** and wait for the **Endpoint** to be created.

1. After the **Endpoint** has been created, select **Test inference**.

## Edit a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-edit"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Edit**.

1. Under **Deployment settings**, you can enable or disable **Auto-scaling**, and change the number of **Max replicas**.

1. Select **Save**.

1. The **Status** will change to **Updating**. Once it changes back to **In service**, your changes are complete and you’ll see a message confirming it.

## Delete a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-delete"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Delete**.

1. In the **Delete HyperPod deployment window**, select the checkbox.

1. Choose **Delete**.

1. The **Status** will change to **Deleting**. Once the HyperPod deployment has been deleted, you’ll see a message confirming it.

# Deploy models from JumpStart using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-js-kubectl"></a>

The following steps show you how to deploy a JumpStart model to a HyperPod cluster using kubectl.

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands. 

## Prerequisites
<a name="kubectl-prerequisites"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="kubectl-prerequisites-setup-and-configuration"></a>

1. Select your Region.

   ```
   export REGION=<region>
   ```

1. View all SageMaker public hub models and HyperPod clusters.

1. Select a `JumpstartModel` from JumpstartPublic Hub. JumpstartPublic hub has a large number of models available so you can use `NextToken` to iteratively list all available models in the public hub.

   ```
   aws sagemaker list-hub-contents --hub-name SageMakerPublicHub --hub-content-type Model --query '{Models: HubContentSummaries[].{ModelId:HubContentName,Version:HubContentVersion}, NextToken: NextToken}' --output json
   ```

   ```
   export MODEL_ID="deepseek-llm-r1-distill-qwen-1-5b"
   export MODEL_VERSION="2.0.4"
   ```

1. Configure the model ID and cluster name you’ve selected into the variables below.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   aws sagemaker list-clusters --output table
   
   # Select the cluster name where you want to deploy the model.
   export HYPERPOD_CLUSTER_NAME="<insert cluster name here>"
   
   # Select the instance that is relevant for your model deployment and exists within the selected cluster.
   # List availble instances in your HyperPod cluster
   aws sagemaker describe-cluster --cluster-name=$HYPERPOD_CLUSTER_NAME --query "InstanceGroups[].{InstanceType:InstanceType,Count:CurrentCount}" --output table
   
   # List supported instance types for the selected model
   aws sagemaker describe-hub-content --hub-name SageMakerPublicHub --hub-content-type Model --hub-content-name "$MODEL_ID" --output json | jq -r '.HubContentDocument | fromjson | {Default: .DefaultInferenceInstanceType, Supported: .SupportedInferenceInstanceTypes}'
   
   
   # Select and instance type from the cluster that is compatible with the model. 
   # Make sure that the selected instance is either default or supported instance type for the jumpstart model 
   export INSTANCE_TYPE="<Instance_type_In_cluster"
   ```

1. Confirm with the cluster admin which namespace you are permitted to use. The admin should have created a `hyperpod-inference` service account in your namespace.

   ```
   export CLUSTER_NAMESPACE="default"
   ```

1. Set a name for endpoint and custom object to be create.

   ```
   export SAGEMAKER_ENDPOINT_NAME="deepsek-qwen-1-5b-test"
   ```

1. The following is an example for a `deepseek-llm-r1-distill-qwen-1-5b` model deployment from Jumpstart. Create a similar deployment yaml file based on the model selected iin the above step.
**Note**  
If your cluster uses GPU partitioning with MIG, you can request specific MIG profiles by adding the `acceleratorPartitionType` field to the server specification. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

   ```
   cat << EOF > jumpstart_model.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
     name: $SAGEMAKER_ENDPOINT_NAME
     namespace: $CLUSTER_NAMESPACE 
   spec:
     sageMakerEndpoint:
       name: $SAGEMAKER_ENDPOINT_NAME
     model:
       modelHubName: SageMakerPublicHub
       modelId: $MODEL_ID
       modelVersion: $MODEL_VERSION
     server:
       instanceType: $INSTANCE_TYPE
       # Optional: Specify GPU partition profile for MIG-enabled instances
       # acceleratorPartitionType: "1g.10gb"
     metrics:
       enabled: true
     environmentVariables:
       - name: SAMPLE_ENV_VAR
         value: "sample_value"
     maxDeployTimeInSeconds: 1800
     autoScalingSpec:
       cloudWatchTrigger:
         name: "SageMaker-Invocations"
         namespace: "AWS/SageMaker"
         useCachedMetrics: false
         metricName: "Invocations"
         targetValue: 10
         minValue: 0.0
         metricCollectionPeriod: 30
         metricStat: "Sum"
         metricType: "Average"
         dimensions:
           - name: "EndpointName"
             value: "$SAGEMAKER_ENDPOINT_NAME"
           - name: "VariantName"
             value: "AllTraffic"
   EOF
   ```

## Deploy your model
<a name="kubectl-deploy-your-model"></a>

**Update your kubernetes configuration and deploy your model**

1. Configure kubectl to connect to the HyperPod cluster orchestrated by Amazon EKS.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your JumpStart model.

   ```
   kubectl apply -f jumpstart_model.yaml
   ```

**Monitor the status of your model deployment**

1. Verify that the model is successfully deployed.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify that the endpoint is successfully created.

   ```
   aws sagemaker describe-endpoint --endpoint-name=$SAGEMAKER_ENDPOINT_NAME --output table
   ```

1. Invoke your model endpoint. You can programmatically retrieve example payloads from the `JumpStartModel` object.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="kubectl-manage-your-deployment"></a>

Delete your JumpStart model deployment once you no longer need it.

```
kubectl delete JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the status of Kubernetes deployment. This command inspects the underlying Kubernetes deployment object that manages the pods running your model. Use this to troubleshoot pod scheduling, resource allocation, and container startup issues.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of your JumpStart model resource. This command examines the custom `JumpStartModel` resource that manages the high-level model configuration and deployment lifecycle. Use this to troubleshoot model-specific issues like configuration errors or SageMaker AI endpoint creation problems.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of all Kubernetes objects. This command provides a comprehensive overview of all related Kubernetes resources in your namespace. Use this for a quick health check to see the overall state of pods, services, deployments, and custom resources associated with your model deployment.

   ```
   kubectl get pods,svc,deployment,JumpStartModel,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm"></a>

The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl. 

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-prereqs"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-setup"></a>

Replace all placeholder values with your actual resource identifiers.

1. Select your Region in your environment.

   ```
   export REGION=<region>
   ```

1. Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   # Specify your hyperpod cluster name here
   HYPERPOD_CLUSTER_NAME="<Hyperpod_cluster_name>"
   
   # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU
   instance_type="ml.g5.8xlarge"
   ```

1. Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.

   ```
   cluster_namespace="<namespace>"
   ```

1. Create a CRD using one of the following options:

------
#### [ Using Amazon FSx as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-fsx"
      ```

   1. Configure the Amazon FSx file system ID to be used.

      ```
      export FSX_FILE_SYSTEM_ID="fs-1234abcd"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon FSx and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_fsx_cluster_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          fsxStorage:
            fileSystemId: $FSX_FILE_SYSTEM_ID
          modelLocation: deepseek-1-5b
          modelSourceType: fsx
        worker:
          environmentVariables:
          - name: HF_MODEL_ID
            value: /opt/ml/model
          - name: SAGEMAKER_PROGRAM
            value: inference.py
          - name: SAGEMAKER_SUBMIT_DIRECTORY
            value: /opt/ml/model/code
          - name: MODEL_CACHE_ROOT
            value: /opt/ml/model
          - name: SAGEMAKER_ENV
            value: '1'
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0
          modelInvocationPort:
            containerPort: 8080
            name: http
          modelVolumeMount:
            mountPath: /opt/ml/model
            name: model-weights
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              cpu: 30000m
              memory: 100Gi
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
      kind: InferenceEndpointConfig
      metadata:
        name: $SAGEMAKER_ENDPOINT_NAME
        namespace: $CLUSTER_NAMESPACE
      spec:
        modelName: deepseek15b
        endpointName: $SAGEMAKER_ENDPOINT_NAME
        instanceType: ml.g5.8xlarge
        invocationEndpoint: invocations
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: $S3_MODEL_LOCATION
            region: $REGION
          modelLocation: deepseek15b
          prefetchEnabled: true
        worker:
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
              cpu: 25600m
              memory: 102Gi
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: bugbash-ada-resources
            region: us-west-2
          modelLocation: models/Llama-3.1-8B-Instruct
          prefetchEnabled: false
        kvCacheSpec:
          enableL1Cache: true
      #    enableL2Cache: true
      #    l2CacheSpec:
      #      l2CacheBackend: redis/sagemaker
      #      l2CacheLocalUrl: redis://redis.redis-system.svc.cluster.local:6379
        intelligentRoutingSpec:
          enabled: true
        tlsConfig:
          tlsCertificateOutputS3Uri: s3://sagemaker-lmcache-fceb9062-tls-6f6ee470
        metrics:
          enabled: true
          modelMetrics:
            port: 8000
        loadBalancer:
          healthCheckPath: /health
        worker:
          resources:
            limits:
              nvidia.com/gpu: "4"
            requests:
              cpu: "6"
              memory: 30Gi
              nvidia.com/gpu: "4"
          image: lmcache/vllm-openai:latest
          args:
            - "/opt/ml/model"
            - "--max-model-len"
            - "20000"
            - "--tensor-parallel-size"
            - "4"
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------

## Configure KV caching and intelligent routing for improved performance
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route"></a>

1. Enable KV caching by setting `enableL1Cache` and `enableL2Cache` to `true`.Then, set `l2CacheSpec` to `redis` and update `l2CacheLocalUrl` with the Redis cluster URL.

   ```
     kvCacheSpec:
       enableL1Cache: true
       enableL2Cache: true
       l2CacheSpec:
         l2CacheBackend: <redis | tieredstorage>
         l2CacheLocalUrl: <redis cluster URL if l2CacheBackend is redis >
   ```
**Note**  
If the redis cluster is not within the same Amazon VPC as the HyperPod cluster, encryption for the data in transit is not guaranteed.
**Note**  
Do not need l2CacheLocalUrl if tieredstorage is selected.

1. Enable intelligent routing by setting `enabled` to `true` under `intelligentRoutingSpec`. You can specify which routing strategy to use under `routingStrategy`. If no routing strategy is specified, it defaults to `prefixaware`.

   ```
   intelligentRoutingSpec:
       enabled: true
       routingStrategy: <routing strategy to use>
   ```

1. Enable router metrics and caching metrics by setting `enabled` to `true` under `metrics`. The `port` value needs to be the same as the `containerPort` value under `modelInvocationPort`.

   ```
   metrics:
       enabled: true
       modelMetrics:
         port: <port value>
       ...
       modelInvocationPort:
         containerPort: <port value>
   ```

## Deploy your model from Amazon S3 or Amazon FSx
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-deploy"></a>

1. Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your InferenceEndpointConfig model with one of the following options:

------
#### [ Deploy with Amazon FSx as a source ]

   ```
   kubectl apply -f deploy_fsx_luster_inference.yaml
   ```

------
#### [ Deploy with Amazon S3 as a source ]

   ```
   kubectl apply -f deploy_s3_inference.yaml
   ```

------

## Verify the status of your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-verify"></a>

1. Check if the model successfully deployed.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check that the endpoint is successfully created.

   ```
   kubectl describe SageMakerEndpointRegistration $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-manage"></a>

When you're finished testing your deployment, use the following commands to clean up your resources.

**Note**  
Verify that you no longer need the deployed model or stored data before proceeding.

**Clean up your resources**

1. Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.

   ```
   kubectl delete inferenceendpointconfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify the cleanup was done successfully.

   ```
   # # Check that Kubernetes resources are removed
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

   ```
   # Verify SageMaker endpoint is deleted (should return error or empty)
   aws sagemaker describe-endpoint --endpoint-name $SAGEMAKER_ENDPOINT_NAME --region $REGION
   ```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the Kubernetes deployment status.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.

   ```
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Autoscaling policies for your HyperPod inference model deployment
<a name="sagemaker-hyperpod-model-deployment-autoscaling"></a>

This following information provides practical examples and configurations for implementing autoscaling policies on Amazon SageMaker HyperPod inference model deployments. 

You'll learn how to configure automatic scaling using the built-in `autoScalingSpec` in your deployment YAML files, as well as how to create standalone KEDA `ScaledObject` configurations for advanced scaling scenarios. The examples cover scaling triggers based on CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource utilization metrics like CPU and memory. 

## Using autoScalingSpec in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-yaml"></a>

Amazon SageMaker HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from CloudWatch and Amazon Managed Prometheus (AMP). The following deployment YAML example includes an `autoScalingSpec` section that defines the configuration values for scaling your model deployment.

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-sample624
  namespace: ns-team-a
spec:
  sageMakerEndpoint:
    name: deepsek7bsme624
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.4
  server:
    instanceType: ml.g5.8xlarge
  metrics:
    enabled: true
  environmentVariables:
    - name: SAMPLE_ENV_VAR
      value: "sample_value"
  maxDeployTimeInSeconds: 1800
  tlsConfig:
    tlsCertificateOutputS3Uri: "s3://{USER}-tls-bucket-{REGION}/certificates"
  autoScalingSpec:
    minReplicaCount: 0
    maxReplicaCount: 5
    pollingInterval: 15
    initialCooldownPeriod: 60
    cooldownPeriod: 120
    scaleDownStabilizationTime: 60
    scaleUpStabilizationTime: 0
    cloudWatchTrigger:
        name: "SageMaker-Invocations"
        namespace: "AWS/SageMaker"
        useCachedMetrics: false
        metricName: "Invocations"
        targetValue: 10.5
        activationTargetValue: 5.0
        minValue: 0.0
        metricCollectionStartTime: 300
        metricCollectionPeriod: 30
        metricStat: "Sum"
        metricType: "Average"
        dimensions:
          - name: "EndpointName"
            value: "deepsek7bsme624"
          - name: "VariantName"
            value: "AllTraffic"
    prometheusTrigger: 
        name: "Prometheus-Trigger"
        useCachedMetrics: false
        serverAddress: http://<prometheus-host>:9090
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        targetValue: 10.0
        activationTargetValue: 5.0
        namespace: "namespace"
        customHeaders: "X-Client-Id=cid"
        metricType: "Value"
```

### Explanation of fields used in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-fields"></a>

`minReplicaCount` (Optional, Integer)  
Specifies the minimum number of model deployment replicas to maintain in the cluster. During scale-down events, the deployment scales down to this minimum number of pods. Must be greater than or equal to 0. Default: 1.

`maxReplicaCount` (Optional, Integer)  
Specifies the maximum number of model deployment replicas to maintain in the cluster. Must be greater than or equal to `minReplicaCount`. During scale-up events, the deployment scales up to this maximum number of pods. Default: 5.

`pollingInterval` (Optional, Integer)  
The time interval in seconds for querying metrics. Minimum: 0. Default: 30 seconds.

`cooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during a scale-down event. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`initialCooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during initial deployment. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`scaleDownStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-down trigger activates before scaling down occurs. Minimum: 0. Default: 300 seconds.

`scaleUpStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-up trigger activates before scaling up occurs. Minimum: 0. Default: 0 seconds.

`cloudWatchTrigger`  
The trigger configuration for CloudWatch metrics used in autoscaling decisions. The following fields are available in `cloudWatchTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `namespace` (Required, String) - The CloudWatch namespace for the metric to query.
+ `metricName` (Required, String) - The name of the CloudWatch metric.
+ `dimensions` (Optional, List) - The list of dimensions for the metric. Each dimension includes a name (dimension name - String) and value (dimension value - String).
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `minValue` (Optional, Float) - The value to use when the CloudWatch query returns no data. Default: 0.
+ `metricCollectionStartTime` (Optional, Integer) - The start time for the metric query, calculated as T-metricCollectionStartTime. Must be greater than or equal to metricCollectionPeriod. Default: 300 seconds.
+ `metricCollectionPeriod` (Optional, Integer) - The duration for the metric query in seconds. Must be a CloudWatch-supported value (1, 5, 10, 30, or a multiple of 60). Default: 300 seconds.
+ `metricStat` (Optional, String) - The statistic type for the CloudWatch query. Default: `Average`.
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

`prometheusTrigger`  
The trigger configuration for Amazon Managed Prometheus (AMP) metrics used in autoscaling decisions. The following fields are available in `prometheusTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `serverAddress` (Required, String) - The address of the AMP server. Must use the format: <https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace\$1id>
+ `query` (Required, String) - The PromQL query used for the metric. Must return a scalar value.
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `namespace` (Optional, String) - The namespace to use for namespaced queries. Default: empty string (`""`).
+ `customHeaders` (Optional, String) - Custom headers to include when querying the Prometheus endpoint. Default: empty string ("").
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

## Using KEDA ScaledObject yaml definitions through kubectl
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl"></a>

In addition to configuring autoscaling through the autoScalingSpec section in your deployment YAML, you can create and apply standalone KEDA `ScaledObject` YAML definitions using kubectl.

This approach provides greater flexibility for complex scaling scenarios and allows you to manage autoscaling policies independently from your model deployments. KEDA `ScaledObject` configurations support a [wide range of scaling triggers](https://keda.sh/docs/2.17/scalers/) including CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like CPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification.

**Note**  
Ensure the keda operator role provided during the HyperPod Inference operator installation has adequate permissions to query the metrics defined in the scaled object triggers.

### CloudWatch metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cw"></a>

The following KEDA yaml policy uses CloudWatch metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-cloudwatch/](https://keda.sh/docs/2.17/scalers/aws-cloudwatch/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-cloudwatch
    metadata:
      namespace: AWS/SageMaker
      metricName: Invocations
      targetMetricValue: "1"
      minMetricValue: "1"
      awsRegion: "us-west-2"
      dimensionName: EndpointName;VariantName
      dimensionValue: $ENDPOINT_NAME;$VARIANT_NAME
      metricStatPeriod: "30" # seconds
      metricStat: "Sum"
      identityOwner: operator
```

### Amazon SQS metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sqs"></a>

The following KEDA yaml policy uses Amazon SQS metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-sqs/](https://keda.sh/docs/2.17/scalers/aws-sqs/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/account_id/QueueName
      queueLength: "5"  # Default: "5"
      awsRegion: "us-west-1"
      scaleOnInFlight: true
      identityOwner: operator
```

### Prometheus metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-prometheus"></a>

The following KEDA yaml policy uses Prometheus metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: avg(rate(http_requests_total{deployment="$DEPLOYMENT_NAME"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      namespace: example-namespace  # for namespaced queries, eg. Thanos
      customHeaders: X-Client-Id=cid,X-Tenant-Id=tid,X-Organization-Id=oid # Optional. Custom headers to include in query. In case of auth header, use the custom authentication or relevant authModes.
      unsafeSsl: "false" #  Default is `false`, Used for skipping certificate check when having self-signed certs for Prometheus endpoint    
      timeout: 1000 # Custom timeout for the HTTP client used in this scaler
      identityOwner: operator
```

### CPU metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cpu"></a>

The following KEDA yaml policy uses cpu metric as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: cpu
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container
```

### Memory metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-memory"></a>

The following KEDA yaml policy uses Prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: memory
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container in a pod
```

## Sample Prometheus policy for scaling down to 0 pods
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sample"></a>

The following KEDA yaml policy uses prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. This policy uses a `minReplicaCount` of 0 which enables KEDA to scale the deployment down to 0 pods. When `minReplicaCount` is set to 0, you need to provide an activation criteria in order to bring up the first pod, after the pods scale down to 0. For the Prometheus trigger, this value is provided by `activationThreshold`. For the SQS queue, it comes from `activationQueueLength`.

**Note**  
While using `minReplicaCount` of 0, make sure the activation does not depend on a metric that is being generated by the pods. When the pods scale down to 0, that metric will never be generated and the pods will not scale up again.

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 0 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  cooldownPeriod:  30
  initialCooldownPeriod:  180 # time before scaling down the pods after initial deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      activationThreshold: '5.5' # Required if minReplicaCount is 0 for initial scaling
      namespace: example-namespace
      timeout: 1000
      identityOwner: operator
```

**Note**  
The CPU and Memory triggers can scale to 0 only when you define at least one additional scaler which is not CPU or Memory (eg. SQS \$1 CPU, or Prometheus \$1 CPU). 

# Implementing inference observability on HyperPod clusters
<a name="sagemaker-hyperpod-model-deployment-observability"></a>

Amazon SageMaker HyperPod provides comprehensive inference observability capabilities that enable data scientists and machine learning engineers to monitor and optimize their deployed models. This solution is enabled through SageMaker HyperPod Observability and automatically collects performance metrics for inference workloads, delivering production-ready monitoring through integrated [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/oss/) dashboards.

With metrics enabled by default, the platform captures essential model performance data including invocation latency, concurrent requests, error rates, and token-level metrics, while providing standard Prometheus endpoints for customers who prefer to implement custom observability solutions.

**Note**  
This topic contains a deep dive in to implementing inference observability on HyperPod clusters. For a more general reference, see [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md).

This guide provides step-by-step instructions for implementing and using inference observability on your HyperPod clusters. You'll learn how to configure metrics in your deployment YAML files, access monitoring dashboards based on your role (administrator, data scientist, or machine learning engineer), integrate with custom observability solutions using Prometheus endpoints, and troubleshoot common monitoring issues.

## Supported inference metrics
<a name="sagemaker-hyperpod-model-deployment-observability-metrics"></a>

**Invocation metrics**

These metrics capture model inference request and response data, providing universal visibility regardless of your model type or serving framework. When inference metrics are enabled, these metrics are calculated at invocation time and exported to your monitoring infrastructure.
+ `model_invocations_total` - Total number of invocation requests to the model 
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent model requests
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Model container metrics**

These metrics provide insights into the internal operations of your model containers, including token processing, queue management, and framework-specific performance indicators. The metrics available depend on your model serving framework:
+ [TGI container metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics) 
+ [LMI container metrics](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md) 

**Metric dimensions**

All inference metrics include comprehensive labels that enable detailed filtering and analysis across your deployments:
+ **Cluster Identity:**
  + `cluster_id` - The unique ID of the HyperPod cluster
  + `cluster_name` - The name of the HyperPod cluster
+ **Resource Identity:**
  + `resource_name` - Deployment name (For example, "jumpstart-model-deployment")
  + `resource_type` - Type of deployment (jumpstart, inference-endpoint)
  + `namespace` - Kubernetes namespace for multi-tenancy
+ **Model Characteristics:**
  + `model_name` - Specific model identifier (For example, "llama-2-7b-chat")
  + `model_version` - Model version for A/B testing and rollbacks
  + `model_container_type` - Serving framework (TGI, LMI, -)
+ **Infrastructure Context:**
  + `pod_name` - Individual pod identifier for debugging
  + `node_name` - Kubernetes node for resource correlation
  + `instance_type` - EC2 instance type for cost analysis
+ **Operational Context:**
  + `metric_source` - Collection point (reverse-proxy, model-container)
  + `task_type` - Workload classification (inference)

## Configure metrics in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-observability-yaml"></a>

Amazon SageMaker HyperPod enables inference metrics by default for all model deployments, providing immediate observability without additional configuration. You can customize metrics behavior by modifying the deployment YAML configuration to enable or disable metrics collection based on your specific requirements.

**Deploy a model from JumpStart**

Use the following YAML configuration to deploy a JuJumpStartmpStart model with metrics enabled:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/
```

**Deploy custom and fine-tuned models from Amazon S3 or Amazon FSx**

Configure custom inference endpoints with detailed metrics settings using the following YAML:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/

Deploy a custom inference endpoint

Configure custom inference endpoints with detailed metrics settings using the following YAML:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: inferenceendpoint-deepseeks
  namespace: ns-team-a
spec:
  modelName: deepseeks
  modelVersion: 1.0.1
  metrics:
    enabled: true # Default: true (can be set to false to disable)
    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s
    modelMetricsConfig:
        port: 8000 # Optional: if overriding, it defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec 8080
        path: "/custom-metrics" # Optional: if overriding the default "/metrics"
  endpointName: deepseek-sm-endpoint
  instanceType: ml.g5.12xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: model-weights
      region: us-west-2
    modelLocation: deepseek
    prefetchEnabled: true
  invocationEndpoint: invocations
  worker:
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
        cpu: 25600m
        memory: 102Gi
    image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
    modelInvocationPort:
      containerPort: 8080
      name: http
    modelVolumeMount:
      name: model-weights
      mountPath: /opt/ml/model
    environmentVariables: ...
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/inferenceendpoint-deepseeks4/certs/
```

**Note**  
To disable metrics for specific deployments, set `metrics.enabled: false` in your YAML configuration.

## Monitor and troubleshoot inference workloads by role
<a name="sagemaker-hyperpod-model-deployment-observability-role"></a>

Amazon SageMaker HyperPod provides comprehensive observability capabilities that support different user workflows, from initial cluster setup to advanced performance troubleshooting. Use the following guidance based on your role and monitoring requirements.

**HyperPod admin**

**Your responsibility:** Enable observability infrastructure and ensure system health across the entire cluster.

**What you need to know:**
+ Cluster-wide observability provides infrastructure metrics for all workloads
+ One-click setup deploys monitoring stack with pre-configured dashboards
+ Infrastructure metrics are separate from model-specific inference metrics

**What you need to do:**

1. Navigate to the HyperPod console.

1. Select your cluster.

1. Go to the HyperPod cluster details page you just created. You will see a new option to install the HyperPod observability add-on.

1. Click on the **Quick install** option. After 1-2 minutes all the steps will be completed and you will see the Grafana dashboard and Prometheus workspace details.

This single action automatically deploys the EKS Add-on, configures observability operators, and provisions pre-built dashboards in Grafana.

**Data scientist**

**Your responsibility:** Deploy models efficiently and monitor their basic performance.

**What you need to know:**
+ Metrics are automatically enabled when you deploy models
+ Grafana dashboards provide immediate visibility into model performance
+ You can filter dashboards to focus on your specific deployments

**What you need to do:**

1. Deploy your model using your preferred method:

   1. Amazon SageMaker Studio UI

   1. HyperPod CLI commands

   1. Python SDK in notebooks

   1. kubectl with YAML configurations

1. Access your model metrics:

   1. Open Amazon SageMaker Studio

   1. Navigate to HyperPod Cluster and open Grafana Dashboard

   1. Select Inference Dashboard

   1. Apply filters to view your specific model deployment

1. Monitor key performance indicators:

   1. Track model latency and throughput

   1. Monitor error rates and availability

   1. Review resource utilization trends

After this is complete, you'll have immediate visibility into your model's performance without additional configuration, enabling quick identification of deployment issues or performance changes.

**Machine learning engineer (MLE)**

**Your responsibility:** Maintain production model performance and resolve complex performance issues.

**What you need to know:**
+ Advanced metrics include model container details like queue depths and token metrics
+ Correlation analysis across multiple metric types reveals root causes
+ Auto-scaling configurations directly impact performance during traffic spikes

**Hypothetical scenario:** A customer's chat model experiences intermittent slow responses. Users are complaining about 5-10 second delays. The MLE can leverage inference observability for systematic performance investigation.

**What you need to do:**

1. Examine the Grafana dashboard to understand the scope and severity of the performance issue:

   1. High latency alert active since 09:30

   1. P99 latency: 8.2s (normal: 2.1s)

   1. Affected time window: 09:30-10:15 (45 minutes)

1. Correlate multiple metrics to understand the system behavior during the incident:

   1. Concurrent requests: Spiked to 45 (normal: 15-20)

   1. Pod scaling: KEDA scaled 2→5 pods during incident

   1. GPU utilization: Remained normal (85-90%)

   1. Memory usage: Normal (24GB/32GB)

1. Examine the distributed system behavior since the infrastructure metrics appear normal:

   1. Node-level view: All pods concentrated on same node (poor distribution)

   1. Model container metrics: TGI queue depth shows 127 requests (normal: 5-10)

   ```
   Available in Grafana dashboard under "Model Container Metrics" panel
           Metric: tgi_queue_size{resource_name="customer-chat-llama"}
           Current value: 127 requests queued (indicates backlog)
   ```

1. Identify interconnected configuration issues:

   1. KEDA scaling policy: Too slow (30s polling interval)

   1. Scaling timeline: Scaling response lagged behind traffic spike by 45\$1 seconds

1. Implement targeted fixes based on the analysis:

   1. Updated KEDA polling interval: 30s → 15s

   1. Increased maxReplicas in scaling configuration

   1. Adjusted scaling thresholds to scale earlier (15 vs 20 concurrent requests)

You can systematically diagnose complex performance issues using comprehensive metrics, implement targeted fixes, and establish preventive measures to maintain consistent production model performance.

## Implement your own observability integration
<a name="sagemaker-hyperpod-model-deployment-observability-diy"></a>

Amazon SageMaker HyperPod exposes inference metrics through industry-standard Prometheus endpoints, enabling integration with your existing observability infrastructure. Use this approach when you prefer to implement custom monitoring solutions or integrate with third-party observability platforms instead of using the built-in Grafana and Prometheus stack.

**Access inference metrics endpoints**

**What you need to know:**
+ Inference metrics are automatically exposed on standardized Prometheus endpoints
+ Metrics are available regardless of your model type or serving framework
+ Standard Prometheus scraping practices apply for data collection

**Inference metrics endpoint configuration:**
+ **Port:** 9113
+ **Path:** /metrics
+ **Full endpoint:** http://pod-ip:9113/metrics

**Available inference metrics:**
+ `model_invocations_total` - Total number of invocation requests to the model
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent requests per model
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Access model container metrics**

**What you need to know:**
+ Model containers expose additional metrics specific to their serving framework
+ These metrics provide internal container insights like token processing and queue depths
+ Endpoint configuration varies by model container type

**For JumpStart model deployments using Text Generation Inference (TGI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /metrics
+ **Documentation:** [https://huggingface.co/docs/text-generation-inference/en/reference/metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)

**For JumpStart model deployments using Large Model Inference (LMI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /server/metrics
+ **Documentation:** [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md)

**For custom inference endpoints (BYOD):**
+ **Port:** Customer-configured (default 8080 Defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec.)
+ **Path:** Customer-configured (default /metrics)

**Implement custom observability integration**

With a custom observability integration, you're responsible for:

1. **Metrics Scraping:** Implement Prometheus-compatible scraping from the endpoints above

1. **Data Export:** Configure export to your chosen observability platform

1. **Alerting:** Set up alerting rules based on your operational requirements

1. **Dashboards:** Create visualization dashboards for your monitoring needs

## Troubleshoot inference observability issues
<a name="sagemaker-hyperpod-model-deployment-observability-troubleshoot"></a>

**The dashboard shows no data**

If the Grafana dashboard is empty and all panels show "No data," perform the following steps to investigate:

1. Verify Administrator has inference observability installed:

   1. Navigate to HyperPod Console > Select cluster > Check if "Observability" status shows "Enabled"

   1. Verify Grafana workspace link is accessible from cluster overview

   1. Confirm Amazon Managed Prometheus workspace is configured and receiving data

1. Verify HyperPod Observabilty is enabled:

   ```
   hyp observability view      
   ```

1. Verify model metrics are enabled:

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled       
   ```

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled        
   ```

1. Check the metrics endpoint:

   ```
   kubectl port-forward pod/customer-chat-llama-xxx 9113:9113
   curl localhost:9113/metrics | grep model_invocations_total# Expected: model_invocations_total{...} metrics
   ```

1. Check the logs:

   ```
   # Model Container
   kubectl logs customer-chat-llama-xxx -c customer-chat-llama# Look for: OOM errors, CUDA errors, model loading failures
   
   # Proxy/SideCar
   kubectl logs customer-chat-llama-xxx -c sidecar-reverse-proxy# Look for: DNS resolution issues, upstream connection failures
   
   # Metrics Exporter Sidecar
   kubectl logs customer-chat-llama-xxx -c otel-collector# Look for: Metrics collection issues, export failures
   ```

**Other common issues**


| Issue | Solution | Action | 
| --- | --- | --- | 
|  Inference observability is not installed  |  Install inference observability through the console  |  "Enable Observability" in HyperPod console  | 
|  Metrics disabled in model  |  Update model configuration  |  Add `metrics: {enabled: true}` to model spec  | 
|  AMP workspace not configured  |  Fix data source connection  |  Verify AMP workspace ID in Grafana data sources  | 
|  Network connectivity  |  Check security groups/NACLs  |  Ensure pods can reach AMP endpoints  | 

# Task governance for model deployment on HyperPod
<a name="sagemaker-hyperpod-model-deployment-task-gov"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for real-time inference workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your inference workloads get the GPU resources they need during traffic spikes while maintaining fair allocation across your teams' training, evaluation, and testing activities. For more general information on task governance, see [SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) .

## How inference workload management works
<a name="sagemaker-hyperpod-model-deployment-task-gov-how"></a>

To effectively manage real-time inference traffic spikes in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for inference workloads with high weights (such as 100) to ensure inference pods are admitted and scheduled before other task types. This configuration enables inference workloads to preempt lower-priority jobs during cluster load, which is critical for maintaining low-latency requirements during traffic surges.

**Quota sizing and allocation**

Reserve sufficient GPU resources in your team's `ClusterQueue` to handle expected inference spikes. During periods of low inference traffic, unused quota resources can be temporarily allocated to other teams' tasks. When inference demand increases, these borrowed resources can be reclaimed to prioritize pending inference pods. For more information, see [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/).

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

1. **Strict Resource Control:** Disable quota lending and borrowing to guarantee reserved GPU capacity is always available for your workloads. This approach requires sizing quotas large enough to independently handle peak demand and may result in idle nodes during low-traffic periods.

1. **Flexible Resource Sharing:** Enable quota borrowing to utilize idle resources from other teams when needed. Borrowed pods are marked as preemptible and may be evicted if the lending team reclaims capacity.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (evaluation, training, and inference) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority inference pods, ensuring real-time inference can run without depending on external quota borrowing. For more information, see [Preemption](https://kueue.sigs.k8s.io/docs/concepts/preemption/).

## Sample inference workload setup
<a name="sagemaker-hyperpod-model-deployment-task-gov-example"></a>

The following example shows how Kueue manages GPU resources in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**  
Your cluster has the following configuration:
+ **Team A**: 10 P4 GPU quota
+ **Team B**: 20 P4 GPU quota
+ **Static provisioning**: No autoscaling
+ **Total capacity**: 30 P4 GPUs

The shared GPU pool uses this priority policy:

1. **Real-time inference**: Priority 100

1. **Training**: Priority 75

1. **Evaluation**: Priority 50

Kueue enforces team quotas and priority classes, with preemption and quota borrowing enabled.

**Initial state: Normal cluster utilization**  
In normal operations:
+ Team A runs training and evaluation jobs on all 10 P4 GPUs
+ Team B runs real-time inference (10 P4s) and evaluation (10 P4s) within its 20 GPU quota
+ The cluster is fully utilized with all jobs admitted and running

**Inference spike: Team B requires additional GPUs**  
When Team B experiences a traffic spike, additional inference pods require 5 more P4 GPUs. Kueue detects that the new pods are:
+ Within Team B's namespace
+ Priority 100 (real-time inference)
+ Pending admission due to quota constraints

**Kueue's response process chooses between two options:**  
**Option 1: Quota borrowing** - If Team A uses only 6 of its 10 P4s, Kueue can admit Team B's pods using the idle 4 P4s. However, these borrowed resources are preemptible—if Team A submits jobs to reach its full quota, Kueue evicts Team B's borrowed inference pods.

**Option 2: Self-preemption (Recommended)** - Team B runs low-priority evaluation jobs (priority 50). When high-priority inference pods are waiting, Kueue preempts the evaluation jobs within Team B's quota and admits the inference pods. This approach provides safe resource allocation with no external eviction risk.

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team B have unused quota?
   + Yes → Admit the pods
   + No → Proceed to Step 2

1. **Self-preemption within Team B**

   Question: Can lower-priority Team B jobs be preempted?
   + Yes → Preempt evaluation jobs (priority 50), free 5 P4s, and admit inference pods
   + No → Proceed to Step 3

   This approach keeps workloads within Team B's guaranteed quota, avoiding external eviction risks.

1. **Borrowing from other teams**

   Question: Is there idle, borrowable quota from other teams?
   + Yes → Admit using borrowed quota (marked as preemptible)
   + No → Pod remains in `NotAdmitted` state

# HyperPod inference troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts"></a>

This troubleshooting guide addresses common issues that can occur during Amazon SageMaker HyperPod inference deployment and operation. These problems typically involve VPC networking configuration, IAM permissions, Kubernetes resource management, and operator connectivity issues that can prevent successful model deployment or cause deployments to fail or remain in pending states.

This troubleshooting guide uses the following terminology: **Troubleshooting steps** are diagnostic procedures to identify and investigate problems, **Resolution** provides the specific actions to fix identified issues, and **Verification** confirms that the solution worked correctly.

**Topics**
+ [Inference operator installation failures through SageMaker AI console](sagemaker-hyperpod-model-deployment-ts-console-cfn-failures.md)
+ [Inference operator installation failures through AWS CLI](sagemaker-hyperpod-model-deployment-ts-cli.md)
+ [Certificate download timeout](sagemaker-hyperpod-model-deployment-ts-certificate.md)
+ [Model deployment issues](sagemaker-hyperpod-model-deployment-ts-deployment-issues.md)
+ [VPC ENI permission issue](sagemaker-hyperpod-model-deployment-ts-permissions.md)
+ [IAM trust relationship issue](sagemaker-hyperpod-model-deployment-ts-trust.md)
+ [Missing NVIDIA GPU plugin error](sagemaker-hyperpod-model-deployment-ts-gpu.md)
+ [Inference operator fails to start](sagemaker-hyperpod-model-deployment-ts-startup.md)

# Inference operator installation failures through SageMaker AI console
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-failures"></a>

**Overview:** When installing the inference operator through the SageMaker AI console using Quick Install or Custom Install, the underlying CloudFormation stacks may fail due to various issues. This section covers common failure scenarios and their resolutions.

## Inference operator add-on installation failure through Quick or Custom install
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-stack-failed"></a>

**Problem:** The HyperPod cluster creation completes successfully, but the inference operator add-on installation fails.

**Common causes:**
+ Pod capacity limits exceeded on cluster nodes. The inference operator installation requires a minimum of 13 pods. The minimum recommended instance type is `ml.c5.4xlarge`.
+ IAM permission issues
+ Resource quota constraints
+ Network or VPC configuration problems

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-symptoms"></a>

**Symptoms:**
+ Inference operator add-on shows CREATE\$1FAILED or DEGRADED status in console
+ CloudFormation stack associated with the add-on is in CREATE\$1FAILED state
+ Installation progress stops or shows error messages

**Diagnostic steps:**

1. Check the inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

1. Check for pod limit issues:

   ```
   # Check current pod count per node
   kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'
   
   # Check pods running on each node
   kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c
   
   # Check for pod evictions or failures
   kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"
   ```

1. Check CloudFormation stack status (if using console installation):

   ```
   # List CloudFormation stacks related to the cluster
   aws cloudformation list-stacks \
       --region $REGION \
       --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
       --output table
   
   # Get detailed stack events
   aws cloudformation describe-stack-events \
       --stack-name <stack-name> \
       --region $REGION \
       --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
       --output table
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-resolution"></a>

To resolve the installation failure, save the current configuration, delete the failed add-on, fix the underlying issue, and then reinstall the inference operator through the SageMaker AI console (recommended) or the AWS CLI.

**Step 1: Save the current configuration**
+ Extract and save the add-on configuration before deletion:

  ```
  # Save the current configuration
  aws eks describe-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION \
      --query 'addon.configurationValues' \
      --output text > addon-config-backup.json
  
  # Verify the configuration was saved
  cat addon-config-backup.json
  
  # Pretty print for readability
  cat addon-config-backup.json | jq '.'
  ```

**Step 2: Delete the failed add-on**
+ Delete the inference operator add-on:

  ```
  aws eks delete-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION
  
  # Wait for deletion to complete
  echo "Waiting for add-on deletion..."
  aws eks wait addon-deleted \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION 2>/dev/null || sleep 60
  ```

**Step 3: Fix the underlying issue**

Choose the appropriate resolution based on the failure cause:

If the issue is pod limit exceeded:

```
# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces
```

**Step 4: Reinstall the inference operator**

After fixing the underlying issue, reinstall the inference operator using one of the following methods:
+ **SageMaker AI console with Custom Install (recommended):** Reuse existing IAM roles and TLS bucket from your previous installation. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).
+ **AWS CLI with saved configuration:** Use the configuration you backed up in Step 1 to reinstall the add-on. For the full CLI installation steps, see [Method 2: Installing the Inference Operator using the AWS CLI](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-addon).

  ```
  aws eks create-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --addon-version v1.0.0-eksbuild.1 \
      --configuration-values file://addon-config-backup.json \
      --region $REGION
  ```
+ **SageMaker AI console with Quick Install:** Creates new IAM roles, TLS bucket, and dependency add-ons automatically. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).

**Step 5: Verify successful installation**

```
# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
```

## Cert-manager installation failed due to Kueue webhook not ready
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-webhook-race"></a>

**Problem:** The cert-manager add-on installation fails with a webhook error because the Task Governance (Kueue) webhook service has no available endpoints. This is a race condition that occurs when cert-manager tries to create resources before the Task Governance webhook pods are fully running. This can happen when Task Governance add-on is being installed along with the Inference operator during cluster creation.

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-symptoms"></a>

**Error message:**

```
AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"
```

**Root cause:**
+ Task Governance add-on installs and registers a mutating webhook that intercepts all Deployment creations
+ Cert-manager add-on tries to create Deployment resources before Task Governance webhook pods are ready
+ Kubernetes admission control calls the Task Governance webhook, but it has no endpoints (pods not running yet)

**Diagnostic step:**

1. Check cert-manager add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-resolution"></a>

**Solution: Delete and reinstall cert-manager**

The Task Governance webhook becomes ready within 60 seconds. Simply delete and reinstall the cert-manager add-on:

1. Delete the failed cert-manager add-on:

   ```
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

1. Wait 30-60 seconds for the Task Governance webhook to become ready, then reinstall the cert-manager add-on:

   ```
   sleep 60
   
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

# Inference operator installation failures through AWS CLI
<a name="sagemaker-hyperpod-model-deployment-ts-cli"></a>

**Overview:** When installing the inference operator through the AWS CLI, add-on installation may fail due to missing dependencies. This section covers common CLI installation failure scenarios and their resolutions.

## Inference add-on installation failed due to missing CSI drivers
<a name="sagemaker-hyperpod-model-deployment-ts-missing-csi-drivers"></a>

**Problem:** The inference operator add-on creation fails because required CSI driver dependencies are not installed on the EKS cluster.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
S3 CSI driver not installed (missing CSIDriver s3.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.

FSx CSI driver not installed (missing CSIDriver fsx.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if CSI drivers are installed:

   ```
   # Check for S3 CSI driver
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   
   # Check for FSx CSI driver  
   kubectl get csidriver fsx.csi.aws.com
   kubectl get pods -n kube-system | grep fsx
   ```

1. Check EKS add-on status:

   ```
   # List all add-ons
   aws eks list-addons --cluster-name $EKS_CLUSTER_NAME --region $REGION
   
   # Check specific CSI driver add-ons
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION 2>/dev/null || echo "S3 CSI driver not installed"
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION 2>/dev/null || echo "FSx CSI driver not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install missing S3 CSI driver**

1. Create IAM role for S3 CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "S3 CSI Role ARN: $S3_CSI_ROLE_ARN"
   ```

1. Install S3 CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --addon-version v1.14.1-eksbuild.1 \
       --service-account-role-arn $S3_CSI_ROLE_ARN \
       --region $REGION
   ```

1. Verify S3 CSI driver installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
   
   # Verify CSI driver is available
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   ```

**Step 2: Install missing FSx CSI driver**

1. Create IAM role for FSx CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export FSX_CSI_ROLE_ARN=$(aws iam get-role --role-name $FSX_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "FSx CSI Role ARN: $FSX_CSI_ROLE_ARN"
   ```

1. Install FSx CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --addon-version v1.6.0-eksbuild.1 \
       --service-account-role-arn $FSX_CSI_ROLE_ARN \
       --region $REGION
   
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
   
   # Verify FSx CSI driver is running
   kubectl get pods -n kube-system | grep fsx
   ```

**Step 3: Verify all dependencies**

After installing the missing dependencies, verify they are running correctly before retrying the inference operator installation:

```
# Check all required add-ons are active
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify all pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

## Inference Custom Resource Definitions are missing during model deployment
<a name="sagemaker-hyperpod-model-deployment-ts-crd-not-exist"></a>

**Problem:** Custom Resource Definitions (CRDs) are missing when you attempt to create model deployments. This issue occurs when you previously installed and deleted the inference add-on without cleaning up model deployments that have finalizers.

**Symptoms and diagnosis:**

**Root cause:**

If you delete the inference add-on without first removing all model deployments, custom resources with finalizers remain in the cluster. These finalizers must complete before you can delete the CRDs. The add-on deletion process doesn't wait for CRD deletion to complete, which causes the CRDs to remain in a terminating state and prevents new installations.

**To diagnose this issue**

1. Check whether CRDs exist.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

1. Check for stuck custom resources.

   ```
   # Check for JumpStartModel resources
   kubectl get jumpstartmodels -A
   
   # Check for InferenceEndpointConfig resources
   kubectl get inferenceendpointconfigs -A
   ```

1. Inspect finalizers on stuck resources.

   ```
   # Example for a specific JumpStartModel
   kubectl get jumpstartmodels <model-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   
   # Example for a specific InferenceEndpointConfig
   kubectl get inferenceendpointconfigs <config-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   ```

**Resolution:**

Manually remove the finalizers from all model deployments that weren't deleted when you removed the inference add-on. Complete the following steps for each stuck custom resource.

**To remove finalizers from JumpStartModel resources**

1. List all JumpStartModel resources across all namespaces.

   ```
   kubectl get jumpstartmodels -A
   ```

1. For each JumpStartModel resource, remove the finalizers by patching the resource to set metadata.finalizers to an empty array.

   ```
   kubectl patch jumpstartmodels <model-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named kv-l1-only.

   ```
   kubectl patch jumpstartmodels kv-l1-only -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the model instance is deleted.

   ```
   kubectl get jumpstartmodels -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=jumpstartmodels": the server could not find the requested resource (get jumpstartmodels.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the JumpStartModel CRD is removed.

   ```
   kubectl get crd | grep jumpstartmodels.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To remove finalizers from InferenceEndpointConfig resources**

1. List all InferenceEndpointConfig resources across all namespaces.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

1. For each InferenceEndpointConfig resource, remove the finalizers.

   ```
   kubectl patch inferenceendpointconfigs <config-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named my-inference-config.

   ```
   kubectl patch inferenceendpointconfigs my-inference-config -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the config instance is deleted.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=inferenceendpointconfigs": the server could not find the requested resource (get inferenceendpointconfigs.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the InferenceEndpointConfig CRD is removed.

   ```
   kubectl get crd | grep inferenceendpointconfigs.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To reinstall the inference add-on**

After you clean up all stuck resources and verify that the CRDs are removed, reinstall the inference add-on. For more information, see [Installing the Inference Operator with EKS add-on](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon).

**Verification:**

1. Verify that the inference add-on is successfully installed.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   ```

   The Status should be ACTIVE and the Health should be HEALTHY.

1. Verify that CRDs are properly installed.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

   You should see the inference-related CRDs listed in the output.

1. Test creating a new model deployment to confirm that the issue is resolved.

   ```
   # Create a test deployment using your preferred method
   kubectl apply -f <your-model-deployment.yaml>
   ```

**Prevention:**

To prevent this issue, complete the following steps before you uninstall the inference add-on.

1. Delete all model deployments.

   ```
   # Delete all JumpStartModel resources
   kubectl delete jumpstartmodels --all -A
   
   # Delete all InferenceEndpointConfig resources
   kubectl delete inferenceendpointconfigs --all -A
   
   # Wait for all resources to be fully deleted
   kubectl get jumpstartmodels -A
   kubectl get inferenceendpointconfigs -A
   ```

1. Verify that all custom resources are deleted.

1. After you confirm that all resources are cleaned up, delete the inference add-on.

## Inference add-on installation failed due to missing cert-manager
<a name="sagemaker-hyperpod-model-deployment-ts-missing-cert-manager"></a>

**Problem:** The inference operator add-on creation fails because the cert-manager EKS Add-On is not installed, resulting in missing Custom Resource Definitions (CRDs).

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
Missing required CRD: certificaterequests.cert-manager.io. 
The cert-manager add-on is not installed. Please install cert-manager and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if cert-manager is installed:

   ```
   # Check for cert-manager CRDs
   kubectl get crd | grep cert-manager
   kubectl get pods -n cert-manager
   
   # Check EKS add-on status
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION 2>/dev/null || echo "Cert-manager not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install cert-manager add-on**

1. Install the cert-manager EKS add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --addon-version v1.18.2-eksbuild.2 \
       --region $REGION
   ```

1. Verify cert-manager installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION
   
   # Verify cert-manager pods are running
   kubectl get pods -n cert-manager
   
   # Verify CRDs are installed
   kubectl get crd | grep cert-manager | wc -l
   # Expected: Should show multiple cert-manager CRDs
   ```

**Step 2: Retry inference operator installation**

1. After cert-manager is installed, retry the inference operator installation:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall the inference operator add-on
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation:

   ```
   # Check installation status
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify inference operator pods are running
   kubectl get pods -n hyperpod-inference-system
   ```

## Inference add-on installation failed due to missing ALB Controller
<a name="sagemaker-hyperpod-model-deployment-ts-missing-alb"></a>

**Problem:** The inference operator add-on creation fails because the AWS Load Balancer Controller is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
ALB Controller not installed (missing aws-load-balancer-controller pods). 
Please install the Application Load Balancer Controller and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if ALB Controller is installed:

   ```
   # Check for ALB Controller pods
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   
   # Check ALB Controller service account
   kubectl get serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null || echo "ALB Controller service account not found"
   kubectl get serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null || echo "ALB Controller service account not found in inference namespace"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install ALB Controller (Recommended)**
+ Ensure the ALB role is created and properly configured in your add-on configuration:

  ```
  # Verify ALB role exists
  export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "ALB Role ARN: $ALB_ROLE_ARN"
  
  # Update your addon-config.json to enable ALB
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": true,
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing ALB Controller installation**
+ If you already have ALB Controller installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable ALB installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": false
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify ALB Controller is working:

   ```
   # Check ALB Controller pods
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   
   # Check service account annotations
   kubectl describe serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null
   ```

## Inference add-on installation failed due to missing KEDA operator
<a name="sagemaker-hyperpod-model-deployment-ts-missing-keda"></a>

**Problem:** The inference operator add-on creation fails because the KEDA (Kubernetes Event Driven Autoscaler) operator is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
KEDA operator not installed (missing keda-operator pods). 
KEDA can be installed separately in any namespace or via the Inference addon.
```

**Diagnostic steps:**

1. Check if KEDA operator is installed:

   ```
   # Check for KEDA operator pods in common namespaces
   kubectl get pods -n keda-system | grep keda-operator 2>/dev/null || echo "KEDA not found in keda-system namespace"
   kubectl get pods -n kube-system | grep keda-operator 2>/dev/null || echo "KEDA not found in kube-system namespace"
   kubectl get pods -n hyperpod-inference-system | grep keda-operator 2>/dev/null || echo "KEDA not found in inference namespace"
   
   # Check for KEDA CRDs
   kubectl get crd | grep keda 2>/dev/null || echo "KEDA CRDs not found"
   
   # Check KEDA service account
   kubectl get serviceaccount keda-operator -A 2>/dev/null || echo "KEDA service account not found"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install KEDA (Recommended)**
+ Ensure the KEDA role is created and properly configured in your add-on configuration:

  ```
  # Verify KEDA role exists
  export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "KEDA Role ARN: $KEDA_ROLE_ARN"
  
  # Update your addon-config.json to enable KEDA
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": true,
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing KEDA installation**
+ If you already have KEDA installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable KEDA installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": false
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify KEDA is working:

   ```
   # Check KEDA pods
   kubectl get pods -n hyperpod-inference-system | grep keda
   kubectl get pods -n kube-system | grep keda
   kubectl get pods -n keda-system | grep keda 2>/dev/null
   
   # Check KEDA CRDs
   kubectl get crd | grep scaledobjects
   kubectl get crd | grep scaledjobs
   
   # Check KEDA service account annotations
   kubectl describe serviceaccount keda-operator -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n kube-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n keda-system 2>/dev/null
   ```

# Certificate download timeout
<a name="sagemaker-hyperpod-model-deployment-ts-certificate"></a>

When deploying a SageMaker AI endpoint, the creation process fails due to the inability to download the certificate authority (CA) certificate in a VPC environment. For detailed configuration steps, refer to the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

**Error message:**

The following error appears in the SageMaker AI endpoint CloudWatch logs: 

```
Error downloading CA certificate: Connect timeout on endpoint URL: "https://****.s3.<REGION>.amazonaws.com/****/***.pem"
```

**Root cause:**
+ This issue occurs when the inference operator cannot access the self-signed certificate in Amazon S3 within your VPC
+ Proper configuration of the Amazon S3 VPC endpoint is essential for certificate access

**Resolution:**

1. If you don't have an Amazon S3 VPC endpoint:
   + Create an Amazon S3 VPC endpoint following the configuration in section 5.3 of the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

1. If you already have an Amazon S3 VPC endpoint:
   + Ensure that the subnet route table is configured to point to the VPC endpoint (if using gateway endpoint) or that private DNS is enabled for interface endpoint.
   + Amazon S3 VPC endpoint should be similar to the configuration mentioned in section 5.3 Endpoint creation step

# Model deployment issues
<a name="sagemaker-hyperpod-model-deployment-ts-deployment-issues"></a>

**Overview:** This section covers common issues that occur during model deployment, including pending states, failed deployments, and monitoring deployment progress.

## Model deployment stuck in pending state
<a name="sagemaker-hyperpod-model-deployment-ts-pending"></a>

When deploying a model, the deployment remains in a "Pending" state for an extended period. This indicates that the inference operator is unable to initiate the model deployment in your HyperPod cluster.

**Components affected:**

During normal deployment, the inference operator should:
+ Deploy model pod
+ Create load balancer
+ Create SageMaker AI endpoint

**Troubleshooting steps:**

1. Check the inference operator pod status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output example:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the inference operator logs and examine the operator logs for error messages:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**
+ Error messages in the operator logs
+ Status of the operator pod
+ Any deployment-related warnings or failures

**Note**  
A healthy deployment should progress beyond the "Pending" state within a reasonable time. If issues persist, review the inference operator logs for specific error messages to determine the root cause.

## Model deployment failed state troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts-failed"></a>

When a model deployment enters a "Failed" state, the failure could occur in one of three components:
+ Model pod deployment
+ Load balancer creation
+ SageMaker AI endpoint creation

**Troubleshooting steps:**

1. Check the inference operator status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the operator logs:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**

The operator logs will indicate which component failed:
+ Model pod deployment failures
+ Load balancer creation issues
+ SageMaker AI endpoint errors

## Checking model deployment progress
<a name="sagemaker-hyperpod-model-deployment-ts-progress"></a>

To monitor the progress of your model deployment and identify potential issues, you can use kubectl commands to check the status of various components. This helps determine whether the deployment is progressing normally or has encountered problems during the model pod creation, load balancer setup, or SageMaker AI endpoint configuration phases.

**Method 1: Check the JumpStart model status**

```
kubectl describe jumpstartmodel.inference.sagemaker.aws.amazon.com/<model-name> -n <namespace>
```

**Key status indicators to monitor:**

1. Deployment Status
   + Look for `Status.State`: Should show `DeploymentComplete`
   + Check `Status.Deployment Status.Available Replicas`
   + Monitor `Status.Conditions` for deployment progress

1. SageMaker AI Endpoint Status
   + Check `Status.Endpoints.Sagemaker.State`: Should show `CreationCompleted`
   + Verify `Status.Endpoints.Sagemaker.Endpoint Arn`

1. TLS Certificate Status
   + View `Status.Tls Certificate` details
   + Check certificate expiration in `Last Cert Expiry Time`

**Method 2: Check the inference endpoint configuration**

```
kubectl describe inferenceendpointconfig.inference.sagemaker.aws.amazon.com/<deployment_name> -n <namespace>
```

**Common status states:**
+ `DeploymentInProgress`: Initial deployment phase
+ `DeploymentComplete`: Successful deployment
+ `Failed`: Deployment failed

**Note**  
Monitor the Events section for any warnings or errors. Check replica count matches expected configuration. Verify all conditions show `Status: True` for a healthy deployment.

# VPC ENI permission issue
<a name="sagemaker-hyperpod-model-deployment-ts-permissions"></a>

SageMaker AI endpoint creation fails due to insufficient permissions for creating network interfaces in VPC.

**Error message:**

```
Please ensure that the execution role for variant AllTraffic has sufficient permissions for creating an endpoint variant within a VPC
```

**Root cause:**

The inference operator's execution role lacks the required Amazon EC2 permission to create network interfaces (ENI) in VPC.

**Resolution:**

Add the following IAM permission to the inference operator's execution role:

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterfacePermission"
     ],
    "Resource": "*"
}
```

**Verification:**

After adding the permission:

1. Delete the failed endpoint (if exists)

1. Retry the endpoint creation

1. Monitor the deployment status for successful completion

**Note**  
This permission is essential for SageMaker AI endpoints running in VPC mode. Ensure the execution role has all other necessary VPC-related permissions as well.

# IAM trust relationship issue
<a name="sagemaker-hyperpod-model-deployment-ts-trust"></a>

HyperPod inference operator fails to start with an STS AssumeRoleWithWebIdentity error, indicating an IAM trust relationship configuration problem.

**Error message:**

```
failed to enable inference watcher for HyperPod cluster *****: operation error SageMaker: UpdateClusterInference, 
get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, 
operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ****, 
api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
```

**Resolution:**

Update the trust relationship of the inference operator's IAM execution role with the following configuration.

Replace the following placeholders:
+ `<ACCOUNT_ID>`: Your AWS account ID
+ `<REGION>`: Your AWS region
+ `<OIDC_ID>`: Your Amazon EKS cluster's OIDC provider ID

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
            "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:sub": "system:serviceaccount:<namespace>:<service-account-name>",
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:aud": "sts.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

**Verification:**

After updating the trust relationship:

1. Verify the role configuration in IAM console

1. Restart the inference operator if necessary

1. Monitor operator logs for successful startup

# Missing NVIDIA GPU plugin error
<a name="sagemaker-hyperpod-model-deployment-ts-gpu"></a>

Model deployment fails with GPU insufficiency error despite having available GPU nodes. This occurs when the NVIDIA device plugin is not installed in the HyperPod cluster.

**Error message:**

```
0/15 nodes are available: 10 node(s) didn't match Pod's node affinity/selector, 
5 Insufficient nvidia.com/gpu. preemption: 0/15 nodes are available: 
10 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.
```

**Root cause:**
+ Kubernetes cannot detect GPU resources without the NVIDIA device plugin
+ Results in scheduling failures for GPU workloads

**Resolution:**

Install the NVIDIA GPU plugin by running:

```
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/refs/tags/v0.17.1/deployments/static/nvidia-device-plugin.yml
```

**Verification steps:**

1. Check the plugin deployment status:

   ```
   kubectl get pods -n kube-system | grep nvidia-device-plugin
   ```

1. Verify GPU resources are now visible:

   ```
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
   ```

1. Retry model deployment

**Note**  
Ensure NVIDIA drivers are installed on GPU nodes. Plugin installation is a one-time setup per cluster. May require cluster admin privileges to install.

# Inference operator fails to start
<a name="sagemaker-hyperpod-model-deployment-ts-startup"></a>

Inference operator pod failed to start and is causing the following error message. This error is due to permission policy on the operator execution role not being authorized to perform `sts:AssumeRoleWithWebIdentity`. Due to this, the operator part running on the control plane is not started.

**Error message:**

```
Warning Unhealthy 5m46s (x22 over 49m) kubelet Startup probe failed: Get "http://10.1.100.59:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
```

**Root cause:**
+ Permission policy of the inference operator execution role is not set to access authorization token for resources.

**Resolution:**

Set the following policy of the execution role of `EXECUTION_ROLE_ARN` for the HyperPod inference operator:

```
HyperpodInferenceAccessPolicy-ml-cluster to include all resources
```

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        }
    ]
}
```

------

**Verification steps:**

1. Change the policy.

1. Terminate the HyperPod inference operator pod.

1. The pod will be restarted without throwing any exceptions.

# Amazon SageMaker HyperPod Inference release notes
<a name="sagemaker-hyperpod-inference-release-notes"></a>

This topic covers release notes that track updates, fixes, and new features for Amazon SageMaker HyperPod Inference. SageMaker HyperPod Inference enables you to deploy and scale machine learning models on your HyperPod clusters with enterprise-grade reliability. For general Amazon SageMaker HyperPod platform releases, updates, and improvements, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

For information about SageMaker HyperPod Inference capabilities and deployment options, see [Deploying models on Amazon SageMaker HyperPod](sagemaker-hyperpod-model-deployment.md).

## SageMaker HyperPod Inference release notes: v3.1
<a name="sagemaker-hyperpod-inference-release-notes-20260403"></a>

**Release Date:** April 3, 2026

**Summary**

Inference Operator v3.1 introduces custom Kubernetes pod configuration, custom certificate support, and per-pod request limits.

**Key Features**
+ **Custom Kubernetes Pod Configuration** – Added a new `kubernetes` field to the `InferenceEndpointConfig` CRD that allows users to customize inference pod configurations:
  + **Custom init containers** – Run user-defined init containers before the inference server starts (for example, cache warming, GDS setup). Init containers are injected after the operator's prefetch container.
  + **Custom volumes** – Add additional volumes (`emptyDir`, `hostPath`, `configMap`, etc.) to the pod spec, which can be referenced by init containers via `volumeMounts`.
  + **Custom scheduler name** – Specify a custom Kubernetes scheduler for pod placement.
+ **Custom Certificates** – Use your own ACM certificates for inference endpoints instead of operator-generated self-signed certificates, configured via `customCertificateConfig`. Supports publicly trusted ACM certificates, AWS Private CA certificates, and certificates imported from external CAs. The operator monitors certificate health and supports automatic renewal detection.
+ **Request Limits** – Control request handling per pod via the new `RequestLimits` configuration under `Worker`, with the following configurable fields:
  + `maxConcurrentRequests` – Maximum concurrent in-flight requests per pod.
  + `maxQueueSize` – Requests to queue when the concurrency limit is reached before rejecting.
  + `overflowStatusCode` – HTTP status code returned when limits are exceeded (default: 429).

For detailed information including prerequisites and upgrade instructions, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-1-prerequisites"></a>

To use the Custom Certificates feature, add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "ACMCertificateAccess",  
    "Effect": "Allow",  
    "Action": [  
        "acm:DescribeCertificate",  
        "acm:GetCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*"  
}
```

### Upgrade to v3.1
<a name="sagemaker-hyperpod-inference-v3-1-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.1
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

## SageMaker HyperPod Inference release notes: v3.0
<a name="sagemaker-hyperpod-inference-release-notes-20260223"></a>

**Release Date:** February 23, 2026

**Summary**

Inference Operator 3.0 introduces EKS Add-on integration for simplified lifecycle management, Node Affinity support for granular scheduling control, and improved resource tagging. Existing Helm-based installations can be migrated to the EKS Add-on using the provided migration script. Update your Inference Operator execution role with new tagging permissions before upgrading.

**Key Features**
+ **EKS Add-on Integration** – Enterprise-grade lifecycle management with simplified installation experience
+ **Node Affinity** – Granular scheduling control for excluding spot instances, preferring availability zones, or targeting nodes with custom labels

For detailed information including prerequisites, upgrade instructions, and migration guidance, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-0-prerequisites"></a>

Before upgrading the Helm version to 3.0, customers should add additional tagging permissions to their Inference operator execution role. As part of improving resource tagging and security, the Inference Operator now tags ALB, S3, and ACM resources. This enhancement requires additional permissions in the Inference Operator execution role. Add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "CertificateTagginPermission",  
    "Effect": "Allow",  
    "Action": [  
        "acm:AddTagsToCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*",  
},  
{  
    "Sid": "S3PutObjectTaggingAccess",  
    "Effect": "Allow",  
    "Action": [  
        "s3:PutObjectTagging"  
    ],  
    "Resource": [  
        "arn:aws:s3:::<TLS_BUCKET>/*" # Replace * with your TLS bucket  
    ]  
}
```

### Upgrade to v3.0
<a name="sagemaker-hyperpod-inference-v3-0-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.0
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

### Helm to EKS Add-on Migration
<a name="sagemaker-hyperpod-inference-v3-0-migration"></a>

If Inference operator is installed through Helm before 3.0 version, we recommend migrating to EKS Add-on to get timely updates on the new features that will be released for Inference Operator. This script migrates the SageMaker HyperPod Inference Operator from Helm-based installation to EKS Add-on installation.

**Overview:** The script takes a cluster name and region as parameters, retrieves the existing Helm installation configuration, and migrates to EKS Add-on deployment. It creates new IAM roles for the Inference Operator, ALB Controller, and KEDA Operator.

Before migrating the Inference Operator, the script ensures required dependencies (S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server) exist. If they don't exist, it deploys them as Add-on.

After the Inference Operator Add-on migration completes, the script also migrates S3, FSx, and other dependencies (ALB, KEDA, cert-manager, metrics-server) if they were originally installed via the Inference Operator Helm chart. Use `--skip-dependencies-migration` to skip this step for S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server. Note that ALB and KEDA are installed as part of the Add-on in the same namespace as Inference Operator, and will be migrated as part of the Inference Operator Add-on.

**Important**  
During the migration, do not deploy new models as they will not be deployed until the migration is completed. Once the Inference Operator Add-on is in ACTIVE state, new models can be deployed. Migration time typically takes 15 to 20 minutes, and it can complete within 30 minutes if only a few models are currently deployed.

**Migration Prerequisites:**
+ AWS CLI configured with appropriate credentials
+ kubectl configured with access to your EKS cluster
+ Helm installed
+ Existing Helm installation of hyperpod-inference-operator

**Note**  
Endpoints that are already running will not be interrupted during the migration process. Existing endpoints will continue to serve traffic without disruption throughout the migration.

**Getting the Migration Script:**

```
git clone https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator/migration
```

**Usage:**

```
./helm_to_addon.sh [OPTIONS] \
  --cluster-name <cluster-name> (Required) \
  --region <region> (Required) \
  --helm-namespace kube-system (Optional) \
  --auto-approve (Optional) \
  --skip-dependencies-migration (Optional) \
  --s3-mountpoint-role-arn <s3-mountpoint-role-arn> (Optional) \
  --fsx-role-arn <fsx-role-arn> (Optional)
```

**Options:**
+ `--cluster-name NAME` – EKS cluster name (required)
+ `--region REGION` – AWS region (required)
+ `--helm-namespace NAMESPACE` – Namespace where Helm chart is installed (default: kube-system) (optional)
+ `--s3-mountpoint-role-arn ARN` – S3 Mountpoint CSI driver IAM role ARN (optional)
+ `--fsx-role-arn ARN` – FSx CSI driver IAM role ARN (optional)
+ `--auto-approve` – Skip confirmation prompts if this flag is enabled. `step-by-step` and `auto-approve` are mutually exclusive, if `--auto-approve` is given, do not specify `--step-by-step` (optional)
+ `--step-by-step` – Pause after each major step for review. This should not be mentioned if `--auto-approve` is already added (optional)
+ `--skip-dependencies-migration` – Skip migration of Helm-installed dependencies to Add-on. For dependencies were NOT installed via the Inference Operator Helm chart, or if you want to manage them separately. (optional)

**Examples:**

Basic migration (migrates dependencies):

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1
```

Auto-approve without prompts:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --auto-approve
```

Skip dependency migration for FSx, S3 mountpoint, cert manager and Metrics server:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --skip-dependencies-migration
```

Provide existing S3 and FSx IAM roles:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --s3-mountpoint-role-arn arn:aws:iam::123456789012:role/s3-csi-role \
  --fsx-role-arn arn:aws:iam::123456789012:role/fsx-csi-role
```

**Backup Location:**

Backups are stored in `/tmp/hyperpod-migration-backup-<timestamp>/`

Backups enable safe migration and recovery:
+ **Rollback on Failure** – If migration fails, the script can automatically restore your cluster to its pre-migration state using the backed up configurations
+ **Audit Trail** – Provides a complete record of what existed before migration for troubleshooting and compliance
+ **Configuration Reference** – Allows you to compare pre-migration and post-migration configurations
+ **Manual Recovery** – If needed, you can manually inspect and restore specific resources from the backup directory

**Rollback:**

If migration fails, the script prompts for user confirmation before initiating rollback to restore the previous state.

## SageMaker HyperPod Inference release notes: v2.3
<a name="sagemaker-hyperpod-inference-release-notes-20260203"></a>

**What's new**

This release introduces new optional fields in the Custom Resource Definitions (CRDs) to enhance deployment configuration flexibility.

**Features**
+ **Multi Instance Types**
  + **Enhanced deployment reliability** – Supports multi-instance type configurations with automatic failover to alternative instance types when preferred options lack capacity
  + **Intelligent resource scheduling** – Uses Kubernetes node affinity to prioritize instance types while guaranteeing deployment even when preferred resources are unavailable
  + **Optimized cost and performance** – Maintains your instance type preferences and prevents capacity-related failures during cluster fluctuations

**Bug Fixes**

Changes to the field `invocationEndpoint` in the spec of the `InferenceEndpointConfig` will now take effect:
+ If the `invocationEndpoint` field is patched or updated, dependent resources, such as the `Ingress`, the Load Balancer, `SageMakerEndpointRegistration`, and SageMaker Endpoint, will be updated with normalisation.
+ The value for `invocationEndpoint` provided will be stored as-is in the `InferenceEndpointConfig` spec itself. When this value is used to create a Load Balancer and— if enabled— a SageMaker Endpoint, it will be normalised to have one leading forward slash.
  + `v1/chat/completions` will be normalised to `/v1/chat/completions` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `v1/chat/completions`.
  + `///invoke` will be normalised to `/invoke` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `invoke`.

**Installing Helm:**

Follow: [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart)

If you are focused on only installing the inference operator, after step 1 i.e. `Set Up Your Helm Environment`, do `cd HyperPodHelmChart/charts/inference-operator`. Since you are in the inference operator chart directory itself, in the commands, wherever you see `helm_chart/HyperPodHelmChart`, replace with `.` .

**Upgrade Operator to v2.3 in case already installed:**

```
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

helm upgrade hyperpod-inference-operator . \
  -n kube-system \
  -f current-values.yaml \
  --set image.tag=v2.3
```