

# Deploy foundation models and custom fine-tuned models
<a name="sagemaker-hyperpod-model-deployment-deploy"></a>

Whether you're deploying pre-trained foundation open-weights or gated models from Amazon SageMaker JumpStart or your own custom or fine-tuned models stored in Amazon S3 or Amazon FSx, SageMaker HyperPod provides the flexible, scalable infrastructure you need for production inference workloads.




****  

|  | Deploy open-weights and gated foundation models from JumpStart | Deploy custom and fine-tuned models from Amazon S3 and Amazon FSx | 
| --- | --- | --- | 
| Description |  Deploy from a comprehensive catalog of pre-trained foundation models with automatic optimization and scaling policies tailored to each model family.  | Bring your own custom and fine-tuned models and leverage SageMaker HyperPod's enterprise infrastructure for production-scale inference. Choose between cost-effective storage with Amazon S3 or a high-performance file system with Amazon FSx. | 
| Key benefits | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 
| Deployment options |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 

The following sections step you through deploying models from Amazon SageMaker JumpStart and from Amazon S3 and Amazon FSx.

**Topics**
+ [Deploy models from JumpStart using Amazon SageMaker Studio](sagemaker-hyperpod-model-deployment-deploy-js-ui.md)
+ [Deploy models from JumpStart using kubectl](sagemaker-hyperpod-model-deployment-deploy-js-kubectl.md)
+ [Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl](sagemaker-hyperpod-model-deployment-deploy-ftm.md)
+ [Deploy custom fine-tuned models using the Python SDK and HPCLI](deploy-trained-model.md) 
+ [Deploy models from Amazon SageMaker JumpStart using the Python SDK and HPCLI](deploy-jumpstart-model.md) 

# Deploy models from JumpStart using Amazon SageMaker Studio
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui"></a>

The following steps show you through how to deploy models from JumpStart using Amazon SageMaker Studio.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-prereqs"></a>

Verify that you've set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md). 

## Create a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-create"></a>

1. In Amazon SageMaker Studio, open the **JumpStart** landing page from the left navigation pane. 

1. Under **All public models**, choose a model you want to deploy.
**Note**  
If you’ve selected a gated model, you’ll have to accept the End User License Agreement (EULA).

1. Choose **SageMaker HyperPod**.

1. Under **Deployment settings**, JumpStart will recommend an instance for deployment. You can modify these settings if necessary.

   1. If you modify **Instance type**, ensure it’s compatible with the chosen **HyperPod cluster**. If there aren’t any compatible instances, you’ll need to select a new **HyperPod cluster **or contact your admin to add compatible instances to the cluster.

   1. To prioritize the model deployment, install the task governance addon, create compute allocations, and set up task rankings for the cluster policy. Once this is done, you should see an option to select a priority for the model deployment which can be used for preemption of other deployments and tasks on the cluster. 

   1. Enter the namespace to which your admin has provided you access. You may have to directly reach out to your admin to get the exact namespace. Once a valid namespace is provided, the **Deploy** button should be enabled to deploy the model.

   1. If your instance type is partitioned (MIG enabled), select a **GPU partition type**.

   1. If you want to enable L2 KVCache or Intelligent routing for speeding up LLM inference, enable them. By default, only L1 KV Cache is enabled. For more details on KVCache and Intelligent routing, see [SageMaker HyperPod model deployment](sagemaker-hyperpod-model-deployment.md).

1. Choose **Deploy** and wait for the **Endpoint** to be created.

1. After the **Endpoint** has been created, select **Test inference**.

## Edit a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-edit"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Edit**.

1. Under **Deployment settings**, you can enable or disable **Auto-scaling**, and change the number of **Max replicas**.

1. Select **Save**.

1. The **Status** will change to **Updating**. Once it changes back to **In service**, your changes are complete and you’ll see a message confirming it.

## Delete a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-delete"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Delete**.

1. In the **Delete HyperPod deployment window**, select the checkbox.

1. Choose **Delete**.

1. The **Status** will change to **Deleting**. Once the HyperPod deployment has been deleted, you’ll see a message confirming it.

# Deploy models from JumpStart using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-js-kubectl"></a>

The following steps show you how to deploy a JumpStart model to a HyperPod cluster using kubectl.

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands. 

## Prerequisites
<a name="kubectl-prerequisites"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="kubectl-prerequisites-setup-and-configuration"></a>

1. Select your Region.

   ```
   export REGION=<region>
   ```

1. View all SageMaker public hub models and HyperPod clusters.

1. Select a `JumpstartModel` from JumpstartPublic Hub. JumpstartPublic hub has a large number of models available so you can use `NextToken` to iteratively list all available models in the public hub.

   ```
   aws sagemaker list-hub-contents --hub-name SageMakerPublicHub --hub-content-type Model --query '{Models: HubContentSummaries[].{ModelId:HubContentName,Version:HubContentVersion}, NextToken: NextToken}' --output json
   ```

   ```
   export MODEL_ID="deepseek-llm-r1-distill-qwen-1-5b"
   export MODEL_VERSION="2.0.4"
   ```

1. Configure the model ID and cluster name you’ve selected into the variables below.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   aws sagemaker list-clusters --output table
   
   # Select the cluster name where you want to deploy the model.
   export HYPERPOD_CLUSTER_NAME="<insert cluster name here>"
   
   # Select the instance that is relevant for your model deployment and exists within the selected cluster.
   # List availble instances in your HyperPod cluster
   aws sagemaker describe-cluster --cluster-name=$HYPERPOD_CLUSTER_NAME --query "InstanceGroups[].{InstanceType:InstanceType,Count:CurrentCount}" --output table
   
   # List supported instance types for the selected model
   aws sagemaker describe-hub-content --hub-name SageMakerPublicHub --hub-content-type Model --hub-content-name "$MODEL_ID" --output json | jq -r '.HubContentDocument | fromjson | {Default: .DefaultInferenceInstanceType, Supported: .SupportedInferenceInstanceTypes}'
   
   
   # Select and instance type from the cluster that is compatible with the model. 
   # Make sure that the selected instance is either default or supported instance type for the jumpstart model 
   export INSTANCE_TYPE="<Instance_type_In_cluster"
   ```

1. Confirm with the cluster admin which namespace you are permitted to use. The admin should have created a `hyperpod-inference` service account in your namespace.

   ```
   export CLUSTER_NAMESPACE="default"
   ```

1. Set a name for endpoint and custom object to be create.

   ```
   export SAGEMAKER_ENDPOINT_NAME="deepsek-qwen-1-5b-test"
   ```

1. The following is an example for a `deepseek-llm-r1-distill-qwen-1-5b` model deployment from Jumpstart. Create a similar deployment yaml file based on the model selected iin the above step.
**Note**  
If your cluster uses GPU partitioning with MIG, you can request specific MIG profiles by adding the `acceleratorPartitionType` field to the server specification. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

   ```
   cat << EOF > jumpstart_model.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
     name: $SAGEMAKER_ENDPOINT_NAME
     namespace: $CLUSTER_NAMESPACE 
   spec:
     sageMakerEndpoint:
       name: $SAGEMAKER_ENDPOINT_NAME
     model:
       modelHubName: SageMakerPublicHub
       modelId: $MODEL_ID
       modelVersion: $MODEL_VERSION
     server:
       instanceType: $INSTANCE_TYPE
       # Optional: Specify GPU partition profile for MIG-enabled instances
       # acceleratorPartitionType: "1g.10gb"
     metrics:
       enabled: true
     environmentVariables:
       - name: SAMPLE_ENV_VAR
         value: "sample_value"
     maxDeployTimeInSeconds: 1800
     autoScalingSpec:
       cloudWatchTrigger:
         name: "SageMaker-Invocations"
         namespace: "AWS/SageMaker"
         useCachedMetrics: false
         metricName: "Invocations"
         targetValue: 10
         minValue: 0.0
         metricCollectionPeriod: 30
         metricStat: "Sum"
         metricType: "Average"
         dimensions:
           - name: "EndpointName"
             value: "$SAGEMAKER_ENDPOINT_NAME"
           - name: "VariantName"
             value: "AllTraffic"
   EOF
   ```

## Deploy your model
<a name="kubectl-deploy-your-model"></a>

**Update your kubernetes configuration and deploy your model**

1. Configure kubectl to connect to the HyperPod cluster orchestrated by Amazon EKS.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your JumpStart model.

   ```
   kubectl apply -f jumpstart_model.yaml
   ```

**Monitor the status of your model deployment**

1. Verify that the model is successfully deployed.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify that the endpoint is successfully created.

   ```
   aws sagemaker describe-endpoint --endpoint-name=$SAGEMAKER_ENDPOINT_NAME --output table
   ```

1. Invoke your model endpoint. You can programmatically retrieve example payloads from the `JumpStartModel` object.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="kubectl-manage-your-deployment"></a>

Delete your JumpStart model deployment once you no longer need it.

```
kubectl delete JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the status of Kubernetes deployment. This command inspects the underlying Kubernetes deployment object that manages the pods running your model. Use this to troubleshoot pod scheduling, resource allocation, and container startup issues.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of your JumpStart model resource. This command examines the custom `JumpStartModel` resource that manages the high-level model configuration and deployment lifecycle. Use this to troubleshoot model-specific issues like configuration errors or SageMaker AI endpoint creation problems.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of all Kubernetes objects. This command provides a comprehensive overview of all related Kubernetes resources in your namespace. Use this for a quick health check to see the overall state of pods, services, deployments, and custom resources associated with your model deployment.

   ```
   kubectl get pods,svc,deployment,JumpStartModel,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm"></a>

The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl. 

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-prereqs"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-setup"></a>

Replace all placeholder values with your actual resource identifiers.

1. Select your Region in your environment.

   ```
   export REGION=<region>
   ```

1. Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   # Specify your hyperpod cluster name here
   HYPERPOD_CLUSTER_NAME="<Hyperpod_cluster_name>"
   
   # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU
   instance_type="ml.g5.8xlarge"
   ```

1. Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.

   ```
   cluster_namespace="<namespace>"
   ```

1. Create a CRD using one of the following options:

------
#### [ Using Amazon FSx as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-fsx"
      ```

   1. Configure the Amazon FSx file system ID to be used.

      ```
      export FSX_FILE_SYSTEM_ID="fs-1234abcd"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon FSx and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_fsx_cluster_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          fsxStorage:
            fileSystemId: $FSX_FILE_SYSTEM_ID
          modelLocation: deepseek-1-5b
          modelSourceType: fsx
        worker:
          environmentVariables:
          - name: HF_MODEL_ID
            value: /opt/ml/model
          - name: SAGEMAKER_PROGRAM
            value: inference.py
          - name: SAGEMAKER_SUBMIT_DIRECTORY
            value: /opt/ml/model/code
          - name: MODEL_CACHE_ROOT
            value: /opt/ml/model
          - name: SAGEMAKER_ENV
            value: '1'
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0
          modelInvocationPort:
            containerPort: 8080
            name: http
          modelVolumeMount:
            mountPath: /opt/ml/model
            name: model-weights
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              cpu: 30000m
              memory: 100Gi
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
      kind: InferenceEndpointConfig
      metadata:
        name: $SAGEMAKER_ENDPOINT_NAME
        namespace: $CLUSTER_NAMESPACE
      spec:
        modelName: deepseek15b
        endpointName: $SAGEMAKER_ENDPOINT_NAME
        instanceType: ml.g5.8xlarge
        invocationEndpoint: invocations
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: $S3_MODEL_LOCATION
            region: $REGION
          modelLocation: deepseek15b
          prefetchEnabled: true
        worker:
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
              cpu: 25600m
              memory: 102Gi
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: bugbash-ada-resources
            region: us-west-2
          modelLocation: models/Llama-3.1-8B-Instruct
          prefetchEnabled: false
        kvCacheSpec:
          enableL1Cache: true
      #    enableL2Cache: true
      #    l2CacheSpec:
      #      l2CacheBackend: redis/sagemaker
      #      l2CacheLocalUrl: redis://redis.redis-system.svc.cluster.local:6379
        intelligentRoutingSpec:
          enabled: true
        tlsConfig:
          tlsCertificateOutputS3Uri: s3://sagemaker-lmcache-fceb9062-tls-6f6ee470
        metrics:
          enabled: true
          modelMetrics:
            port: 8000
        loadBalancer:
          healthCheckPath: /health
        worker:
          resources:
            limits:
              nvidia.com/gpu: "4"
            requests:
              cpu: "6"
              memory: 30Gi
              nvidia.com/gpu: "4"
          image: lmcache/vllm-openai:latest
          args:
            - "/opt/ml/model"
            - "--max-model-len"
            - "20000"
            - "--tensor-parallel-size"
            - "4"
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------

## Configure KV caching and intelligent routing for improved performance
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route"></a>

1. Enable KV caching by setting `enableL1Cache` and `enableL2Cache` to `true`.Then, set `l2CacheSpec` to `redis` and update `l2CacheLocalUrl` with the Redis cluster URL.

   ```
     kvCacheSpec:
       enableL1Cache: true
       enableL2Cache: true
       l2CacheSpec:
         l2CacheBackend: <redis | tieredstorage>
         l2CacheLocalUrl: <redis cluster URL if l2CacheBackend is redis >
   ```
**Note**  
If the redis cluster is not within the same Amazon VPC as the HyperPod cluster, encryption for the data in transit is not guaranteed.
**Note**  
Do not need l2CacheLocalUrl if tieredstorage is selected.

1. Enable intelligent routing by setting `enabled` to `true` under `intelligentRoutingSpec`. You can specify which routing strategy to use under `routingStrategy`. If no routing strategy is specified, it defaults to `prefixaware`.

   ```
   intelligentRoutingSpec:
       enabled: true
       routingStrategy: <routing strategy to use>
   ```

1. Enable router metrics and caching metrics by setting `enabled` to `true` under `metrics`. The `port` value needs to be the same as the `containerPort` value under `modelInvocationPort`.

   ```
   metrics:
       enabled: true
       modelMetrics:
         port: <port value>
       ...
       modelInvocationPort:
         containerPort: <port value>
   ```

## Deploy your model from Amazon S3 or Amazon FSx
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-deploy"></a>

1. Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your InferenceEndpointConfig model with one of the following options:

------
#### [ Deploy with Amazon FSx as a source ]

   ```
   kubectl apply -f deploy_fsx_luster_inference.yaml
   ```

------
#### [ Deploy with Amazon S3 as a source ]

   ```
   kubectl apply -f deploy_s3_inference.yaml
   ```

------

## Verify the status of your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-verify"></a>

1. Check if the model successfully deployed.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check that the endpoint is successfully created.

   ```
   kubectl describe SageMakerEndpointRegistration $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-manage"></a>

When you're finished testing your deployment, use the following commands to clean up your resources.

**Note**  
Verify that you no longer need the deployed model or stored data before proceeding.

**Clean up your resources**

1. Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.

   ```
   kubectl delete inferenceendpointconfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify the cleanup was done successfully.

   ```
   # # Check that Kubernetes resources are removed
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

   ```
   # Verify SageMaker endpoint is deleted (should return error or empty)
   aws sagemaker describe-endpoint --endpoint-name $SAGEMAKER_ENDPOINT_NAME --region $REGION
   ```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the Kubernetes deployment status.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.

   ```
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```