Inference operator add-on installation failure through Quick or Custom install Cert-manager installation failed due to Kueue webhook not ready

Inference operator installation failures through SageMaker AI console

Overview: When installing the inference operator through the SageMaker AI console using Quick Install or Custom Install, the underlying CloudFormation stacks may fail due to various issues. This section covers common failure scenarios and their resolutions.

Inference operator add-on installation failure through Quick or Custom install

Problem: The HyperPod cluster creation completes successfully, but the inference operator add-on installation fails.

Common causes:

Pod capacity limits exceeded on cluster nodes. The inference operator installation requires a minimum of 13 pods. The minimum recommended instance type is ml.c5.4xlarge.
IAM permission issues
Resource quota constraints
Network or VPC configuration problems

Symptoms and diagnosis

Symptoms:

Inference operator add-on shows CREATE_FAILED or DEGRADED status in console
CloudFormation stack associated with the add-on is in CREATE_FAILED state
Installation progress stops or shows error messages

Diagnostic steps:

Check the inference operator add-on status:


aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health,Issues:issues}" \
    --output json

Check for pod limit issues:


# Check current pod count per node
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'

# Check pods running on each node
kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c

# Check for pod evictions or failures
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"

Check CloudFormation stack status (if using console installation):


# List CloudFormation stacks related to the cluster
aws cloudformation list-stacks \
    --region $REGION \
    --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
    --output table

# Get detailed stack events
aws cloudformation describe-stack-events \
    --stack-name <stack-name> \
    --region $REGION \
    --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
    --output table

Resolution

To resolve the installation failure, save the current configuration, delete the failed add-on, fix the underlying issue, and then reinstall the inference operator through the SageMaker AI console (recommended) or the AWS CLI.

Step 1: Save the current configuration

Extract and save the add-on configuration before deletion:


# Save the current configuration
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query 'addon.configurationValues' \
    --output text > addon-config-backup.json

# Verify the configuration was saved
cat addon-config-backup.json

# Pretty print for readability
cat addon-config-backup.json | jq '.'

Step 2: Delete the failed add-on

Delete the inference operator add-on:


aws eks delete-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION

# Wait for deletion to complete
echo "Waiting for add-on deletion..."
aws eks wait addon-deleted \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION 2>/dev/null || sleep 60

Step 3: Fix the underlying issue

Choose the appropriate resolution based on the failure cause:

If the issue is pod limit exceeded:


# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces

Step 4: Reinstall the inference operator

After fixing the underlying issue, reinstall the inference operator using one of the following methods:

SageMaker AI console with Custom Install (recommended): Reuse existing IAM roles and TLS bucket from your previous installation. For steps, see Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended).

AWS CLI with saved configuration: Use the configuration you backed up in Step 1 to reinstall the add-on. For the full CLI installation steps, see Method 2: Installing the Inference Operator using the AWS CLI.


aws eks create-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --addon-version v1.0.0-eksbuild.1 \
    --configuration-values file://addon-config-backup.json \
    --region $REGION

SageMaker AI console with Quick Install: Creates new IAM roles, TLS bucket, and dependency add-ons automatically. For steps, see Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended).

Step 5: Verify successful installation


# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50

Cert-manager installation failed due to Kueue webhook not ready

Problem: The cert-manager add-on installation fails with a webhook error because the Task Governance (Kueue) webhook service has no available endpoints. This is a race condition that occurs when cert-manager tries to create resources before the Task Governance webhook pods are fully running. This can happen when Task Governance add-on is being installed along with the Inference operator during cluster creation.

Symptoms and diagnosis

Error message:


AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"

Root cause:

Task Governance add-on installs and registers a mutating webhook that intercepts all Deployment creations
Cert-manager add-on tries to create Deployment resources before Task Governance webhook pods are ready
Kubernetes admission control calls the Task Governance webhook, but it has no endpoints (pods not running yet)

Diagnostic step:

Check cert-manager add-on status:


aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION \
    --query "addon.{Status:status,Health:health,Issues:issues}" \
    --output json

Resolution

Solution: Delete and reinstall cert-manager

The Task Governance webhook becomes ready within 60 seconds. Simply delete and reinstall the cert-manager add-on:

Delete the failed cert-manager add-on:


aws eks delete-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION

Wait 30-60 seconds for the Task Governance webhook to become ready, then reinstall the cert-manager add-on:


sleep 60

aws eks create-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshooting

Inference operator installation failures through AWS CLI