

# Best practices
<a name="best-practices"></a>

The following topics provide guidance on best practices for deploying machine learning models in Amazon SageMaker AI.

**Topics**
+ [Best practices for deploying models on SageMaker AI Hosting Services](deployment-best-practices.md)
+ [Monitor Security Best Practices](monitor-sec-best-practices.md)
+ [Low latency real-time inference with AWS PrivateLink](realtime-endpoints-privatelink.md)
+ [Migrate inference workload from x86 to AWS Graviton](realtime-endpoints-graviton.md)
+ [Troubleshoot Amazon SageMaker AI model deployments](deploy-model-troubleshoot.md)
+ [Inference cost optimization best practices](inference-cost-optimization.md)
+ [Best practices to minimize interruptions during GPU driver upgrades](inference-gpu-drivers.md)
+ [Best practices for endpoint security and health with Amazon SageMaker AI](best-practice-endpoint-security.md)
+ [Updating inference containers to comply with the NVIDIA Container Toolkit](container-nvidia-compliance.md)

# Best practices for deploying models on SageMaker AI Hosting Services
<a name="deployment-best-practices"></a>

When hosting models using SageMaker AI hosting services, consider the following:
+ Typically, a client application sends requests to the SageMaker AI HTTPS endpoint to obtain inferences from a deployed model. You can also send requests to this endpoint from your Jupyter notebook during testing.
+ You can deploy a model trained with SageMaker AI to your own deployment target. To do that, you need to know the algorithm-specific format of the model artifacts that were generated by model training. For more information about output formats, see the section corresponding to the algorithm you are using in [Common Data Formats for Training](cdf-training.md). 
+ You can deploy multiple variants of a model to the same SageMaker AI HTTPS endpoint. This is useful for testing variations of a model in production. For example, suppose that you've deployed a model into production. You want to test a variation of the model by directing a small amount of traffic, say 5%, to the new model. To do this, create an endpoint configuration that describes both variants of the model. You specify the `ProductionVariant` in your request to the `CreateEndPointConfig`. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). 
+ You can configure a `ProductionVariant` to use Application Auto Scaling. For information about configuring automatic scaling, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md).
+ You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML Compute instance configurations of existing model variants, or change the distribution of traffic among model variants. To modify an endpoint, you provide a new endpoint configuration. SageMaker AI implements the changes without any downtime. For more information see, [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html). 
+ Changing or deleting model artifacts or changing inference code after deploying a model produces unpredictable results. If you need to change or delete model artifacts or change inference code, modify the endpoint by providing a new endpoint configuration. Once you provide the new endpoint configuration, you can change or delete the model artifacts corresponding to the old endpoint configuration.
+ If you want to get inferences on entire datasets, consider using batch transform as an alternative to hosting services. For information, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md) 

## Deploy Multiple Instances Across Availability Zones
<a name="deployment-best-practices-availability-zones"></a>

**Create robust endpoints when hosting your model.** SageMaker AI endpoints can help protect your application from [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) outages and instance failures. If an outage occurs or an instance fails, SageMaker AI automatically attempts to distribute your instances across Availability Zones. For this reason, we strongly recommend that you deploy multiple instances for each production endpoint. 

If you are using an [Amazon Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html), configure the VPC with at least two [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html#SageMaker-Type-VpcConfig-Subnets                     .html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html#SageMaker-Type-VpcConfig-Subnets                     .html), each in a different Availability Zone. If an outage occurs or an instance fails, Amazon SageMaker AI automatically attempts to distribute your instances across Availability Zones. 

In general, to achieve more reliable performance, use more small [Instance Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html) in different Availability Zones to host your endpoints.

**Deploy inference components for high availability.** In addition to the above recommendation for instance numbers, to achieve 99.95% availability, ensure that your inference components are configured to have more than two copies. In addition, in your managed auto scaling policy, set the minimum number of instances to two as well.

# Monitor Security Best Practices
<a name="monitor-sec-best-practices"></a>

Monitor your usage of SageMaker AI as it relates to security best practices by using [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html). Security Hub CSPM uses security controls to evaluate resource configurations and security standards to help you comply with various compliance frameworks. For more information about using Security Hub CSPM to evaluate SageMaker AI resources, see [Amazon SageMaker AI controls](https://docs.aws.amazon.com/securityhub/latest/userguide/sagemaker-controls.html) in the *AWS Security Hub CSPM User Guide*.

# Low latency real-time inference with AWS PrivateLink
<a name="realtime-endpoints-privatelink"></a>

 Amazon SageMaker AI provides low latency for real-time inferences while maintaining high availability and resiliency using multi-AZ deployment. The application latency is made up of two primary components: infrastructure or overhead latency and model inference latency. Reduction of overhead latency opens up new possibilities such as deploying more complex, deep, and accurate models or splitting monolithic applications into scalable and maintainable microservice modules. You can reduce the latency for real-time inferences with SageMaker AI using an AWS PrivateLink deployment. With AWS PrivateLink, you can privately access all SageMaker API operations from your Virtual Private Cloud (VPC) in a scalable manner by using interface VPC endpoints. An interface VPC endpoint is an elastic network interface in your subnet with private IP addresses that serves as an entry point for all SageMaker API calls.

By default, a SageMaker AI endpoint with 2 or more instances is deployed in at least 2 AWS Availability Zones (AZs) and instances in any AZ can process invocations. This results in one or more AZ “hops” that contribute to the overhead latency. An AWS PrivateLink deployment with the `privateDNSEnabled` option set as `true` alleviates this by achieving two objectives:
+ It keeps all inference traffic within your VPC.
+ It keeps invocation traffic in the same AZ as the client that originated it when using SageMaker Runtime. This avoids the “hops” between AZs reducing the overhead latency.

The following sections of this guide demonstrate how you can reduce the latency for real-time inferences with AWS PrivateLink deployment.

**Topics**
+ [Deploy AWS PrivateLink](#deploy-privatelink)
+ [Deploy SageMaker AI endpoint in a VPC](#deploy-sagemaker-inference-endpoint)
+ [Invoke the SageMaker AI endpoint](#invoke-sagemaker-inference-endpoint)

## Deploy AWS PrivateLink
<a name="deploy-privatelink"></a>

To deploy AWS PrivateLink, first create an interface endpoint for the VPC from which you connect to the SageMaker AI endpoints. Please follow the steps in [ Access an AWS service using an interface VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html) to create the interface endpoint. While creating the endpoint, select the following settings in the console interface:
+ Select the **Enable DNS name** checkbox under **Additional Settings**
+ Select the appropriate security groups and the subnets to be used with the SageMaker AI endpoints.

Also make sure that the VPC has DNS hostnames turned on. For more information on how to change DNS attributes for your VPC, see [ View and update DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-updating).

## Deploy SageMaker AI endpoint in a VPC
<a name="deploy-sagemaker-inference-endpoint"></a>

To achieve low overhead latency, create a SageMaker AI endpoint using the same subnets that you specified when deploying AWS PrivateLink. These subnets should match the AZs of your client application, as shown in the following code snippet.

```
model_name = '<the-name-of-your-model>'

vpc = 'vpc-0123456789abcdef0'
subnet_a = 'subnet-0123456789abcdef0'
subnet_b = 'subnet-0123456789abcdef1'
security_group = 'sg-0123456789abcdef0'

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url
    },
    VpcConfig = {
        'SecurityGroupIds': [security_group],
        'Subnets': [subnet_a, subnet_b],
    },
)
```

The aforementioned code snippet assumes that you have followed the steps in [Before you begin](realtime-endpoints-deploy-models.md#deploy-prereqs).

## Invoke the SageMaker AI endpoint
<a name="invoke-sagemaker-inference-endpoint"></a>

Finally, specify the SageMaker Runtime client and invoke the SageMaker AI endpoint as shown in the following code snippet.

```
endpoint_name = '<endpoint-name>'
  
runtime_client = boto3.client('sagemaker-runtime')
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                          ContentType='text/csv', 
                                          Body=payload)
```

For more information on endpoint configuration, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).

# Migrate inference workload from x86 to AWS Graviton
<a name="realtime-endpoints-graviton"></a>

 [AWS Graviton](https://aws.amazon.com/ec2/graviton/) is a series of ARM-based processors designed by AWS. They are more energy efficient than x86-based processors and offer a compelling price-performance ratio. Amazon SageMaker AI offers Graviton-based instances so that you can take advantage of these advanced processors for your inference needs. 

 You can migrate your existing inference workloads from x86-based instances to Graviton-based instances, by using either ARM compatible container images or multi-architecture container images. This guide assumes that you are either using [AWS Deep Learning container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or your own ARM compatible container images. For more information on building your own images, check [Building your image](https://github.com/aws/deep-learning-containers#building-your-image). 

 At a high level, migrating inference workload from x86-based instances to Graviton-based instances is a four-step process: 

1. Push container images to Amazon Elastic Container Registry (Amazon ECR), an AWS managed container registry.

1. Create a SageMaker AI Model.

1. Create an endpoint configuration.

1. Create an endpoint.

 The following sections of this guide provide more details regarding the above steps. Replace the *user placeholder text* in the code examples with your own information. 

**Topics**
+ [Push container images to Amazon ECR](#realtime-endpoints-graviton-ecr)
+ [Create a SageMaker AI Model](#realtime-endpoints-graviton-model)
+ [Create an endpoint configuration](#realtime-endpoints-graviton-epc)
+ [Create an endpoint](#realtime-endpoints-graviton-ep)

## Push container images to Amazon ECR
<a name="realtime-endpoints-graviton-ecr"></a>

 You can push your container images to Amazon ECR with the AWS CLI. When using an ARM compatible image, verify that it supports ARM architecture: 

```
docker inspect deep-learning-container-uri
```

 The response `"Architecture": "arm64"` indicates that the image supports ARM architecture. You can push it to Amazon ECR with the `docker push` command. For more information, check [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html). 

 Multi-architecture container images are fundamentally a set of container images supporting different architectures or operating systems, that you can refer to by a common manifest name. If you are using multi-architecture container images, then in addition to pushing the images to Amazon ECR, you will also have to push a manifest list to Amazon ECR. A manifest list allows for the nested inclusion of other image manifests, where each included image is specified by architecture, operating system and other platform attributes. The following example creates a manifest list, and pushes it to Amazon ECR. 

1. Create a manifest list.

   ```
   docker manifest create aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository \
     aws-account-id.dkr.ecr.aws-account-id.amazonaws.com/my-repository:amd64 \
   	aws-account-id.dkr.ecr.aws-account-id.amazonaws.com/my-repository:arm64 \
   ```

1.  Annotate the manifest list, so that it correctly identifies which image is for which architecture. 

   ```
   docker manifest annotate --arch arm64 aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository \
     aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository:arm64
   ```

1. Push the manifest.

   ```
   docker manifest push aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository
   ```

 For more information on creating and pushing manifest lists to Amazon ECR, check [Introducing multi-architecture container images for Amazon ECR](https://aws.amazon.com/blogs/containers/introducing-multi-architecture-container-images-for-amazon-ecr/), and [Pushing a multi-architecture image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-multi-architecture-image.html). 

## Create a SageMaker AI Model
<a name="realtime-endpoints-graviton-model"></a>

 Create a SageMaker AI Model by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. 

```
import boto3
from sagemaker import get_execution_role


aws_region = "aws-region"
sagemaker_client = boto3.client("sagemaker", region_name=aws_region)

role = get_execution_role()

sagemaker_client.create_model(
    ModelName = "model-name",
    PrimaryContainer = {
        "Image": "deep-learning-container-uri",
        "ModelDataUrl": "model-s3-location",
        "Environment": {
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_SUBMIT_DIRECTORY": "inference-script-s3-location",
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_REGION": aws_region,
        }
    },
    ExecutionRoleArn = role
)
```

## Create an endpoint configuration
<a name="realtime-endpoints-graviton-epc"></a>

 Create an endpoint configuration by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. For a list of Graviton-based instances, check [Compute optimized instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized-instances.html). 

```
sagemaker_client.create_endpoint_config(
    EndpointConfigName = "endpoint-config-name",
    ProductionVariants = [
        {
            "VariantName": "variant-name",
            "ModelName": "model-name",
            "InitialInstanceCount": 1,
            "InstanceType": "ml.c7g.xlarge", # Graviton-based instance
       }
    ]
)
```

## Create an endpoint
<a name="realtime-endpoints-graviton-ep"></a>

 Create an endpoint by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API. 

```
sagemaker_client.create_endpoint(
    EndpointName = "endpoint-name",
    EndpointConfigName = "endpoint-config-name"
)
```

# Troubleshoot Amazon SageMaker AI model deployments
<a name="deploy-model-troubleshoot"></a>

If you encounter an issue when deploying machine learning models in Amazon SageMaker AI, see the following guidance.

**Topics**
+ [Detection Errors in the Active CPU Count](#deploy-model-troubleshoot-jvms)
+ [Issues with deploying a model.tar.gz file](#deploy-model-troubleshoot-tarballs)
+ [Primary container did not pass ping health checks](#deploy-model-troubleshoot-ping)

## Detection Errors in the Active CPU Count
<a name="deploy-model-troubleshoot-jvms"></a>

If you deploy a SageMaker AI model with a Linux Java Virtual Machine (JVM), you might encounter detection errors that prevent using available CPU resources. This issue affects some JVMs that support Java 8 and Java 9, and most that support Java 10 and Java 11. These JVMs implement a mechanism that detects and handles the CPU count and the maximum memory available when running a model in a Docker container, and, more generally, within Linux `taskset` commands or control groups (cgroups). SageMaker AI deployments take advantage of some of the settings that the JVM uses for managing these resources. Currently, this causes the container to incorrectly detect the number of available CPUs. 

SageMaker AI doesn't limit access to CPUs on an instance. However, the JVM might detect the CPU count as `1` when more CPUs are available for the container. As a result, the JVM adjusts all of its internal settings to run as if only `1` CPU core is available. These settings affect garbage collection, locks, compiler threads, and other JVM internals that negatively affect the concurrency, throughput, and latency of the container.

For an example of the misdetection, in a container configured for SageMaker AI that is deployed with a JVM that is based on Java8\$1191 and that has four available CPUs on the instance, run the following command to start your JVM:

```
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version
```

This generates the following output:

```
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
```

Many of the JVMs affected by this issue have an option to disable this behavior and reestablish full access to all of the CPUs on the instance. Disable the unwanted behavior and establish full access to all instance CPUs by including the `-XX:-UseContainerSupport` parameter when starting Java applications. For example, run the `java` command to start your JVM as follows:

```
java -XX:-UseContainerSupport -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version
```

This generates the following output:

```
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
```

Check whether the JVM used in your container supports the `-XX:-UseContainerSupport` parameter. If it does, always pass the parameter when you start your JVM. This provides access to all of the CPUs in your instances. 

You might also encounter this issue when indirectly using a JVM in SageMaker AI containers. For example, when using a JVM to support SparkML Scala. The `-XX:-UseContainerSupport` parameter also affects the output returned by the Java `Runtime.getRuntime().availableProcessors()` API ``. 

## Issues with deploying a model.tar.gz file
<a name="deploy-model-troubleshoot-tarballs"></a>

When you deploy a model using a `model.tar.gz` file, the model tarball should not include any symlinks. Symlinks cause the model creation to fail. Also, we recommend that you do not include any unnecessary files in the tarball.

## Primary container did not pass ping health checks
<a name="deploy-model-troubleshoot-ping"></a>

 If your primary container fails ping health checks with the following error message, it indicates that there is an issue with your container or script: 

```
The primary container for production variant beta did not pass the ping health check. Please check CloudWatch Logs logs for this endpoint.
```

 To troubleshoot this issue, you should check the CloudWatch Logs logs for the endpoint in question to see if there are any errors or issues that are preventing the container from responding to `/ping` or `/invocations`. The logs may provide an error message that could point to the issue. Once you have identified the error and failure reason you should resolve the error. 

 It is also good practice to test the model deployment locally before creating an endpoint. 
+  Use local mode in the SageMaker SDK to imitate the hosted environment by deploying the model to a local endpoint. For more information, see [Local Mode](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode). 
+  Use vanilla docker commands to test the container responds to /ping and /invocations. For more information, see [local\$1test](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container/local_test). 

# Inference cost optimization best practices
<a name="inference-cost-optimization"></a>

The following content provides techniques and considerations for optimizing the cost of endpoints. You can use these recommendations to optimize the cost for both new and existing endpoints.

## Best practices
<a name="inference-cost-optimization-list"></a>

To optimize your SageMaker AI Inference costs, follow these best practices.

### Pick the best inference option for the job.
<a name="collapsible-1"></a>

SageMaker AI offers 4 different inference options to provide the best inference option for the job. You may be able to save on costs by picking the inference option that best matches your workload.
+ Use [real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) for low latency workloads with predictable traffic patterns that need to have consistent latency characteristics and are always available. You pay for using the instance.
+ Use [serverless inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) for synchronous workloads that have a spiky traffic pattern and can accept variations in the p99 latency. Serverless inference automatically scales to meet your workload traffic so you don’t pay for any idle resources. You only pay for the duration of the inference request. The same model and containers can be used with both real-time and serverless inference so you can switch between these two modes if your needs change.
+ Use [asynchronous inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) for asynchronous workloads that process up to 1 GB of data (such as text corpus, image, video, and audio) that are latency insensitive and cost sensitive. With asynchronous inference, you can control costs by specifying a fixed number of instances for the optimal processing rate instead of provisioning for the peak. You can also scale down to zero to save additional costs.
+ Use [batch inference](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) for workloads for which you need inference for a large set of data for processes that happen offline (that is, you don’t need a persistent endpoint). You pay for the instance for the duration of the batch inference job.

### Opt in to a SageMaker AI Savings Plan.
<a name="collapsible-2"></a>
+ If you have a consistent usage level across all SageMaker AI services, you can opt in to a SageMaker AI Savings Plan to help reduce your costs by up to 64%.
+ [Amazon SageMaker AI Savings Plans](https://aws.amazon.com/savingsplans/ml-pricing/) provide a flexible pricing model for Amazon SageMaker AI, in exchange for a commitment to a consistent amount of usage (measured in \$1/hour) for a one-year or three-year term. These plans automatically apply to eligible SageMaker AI ML instance usages including SageMaker Studio Classic Notebook, SageMaker On-Demand Notebook, SageMaker Processing, SageMaker Data Wrangler, SageMaker Training, SageMaker Real-Time Inference, and SageMaker Batch Transform regardless of instance family, size, or Region. For example, you can change usage from a CPU ml.c5.xlarge instance running in US East (Ohio) to a ml.Inf1 instance in US West (Oregon) for inference workloads at any time and automatically continue to pay the Savings Plans price.

### Optimize your model to run better.
<a name="collapsible-3"></a>
+ Unoptimized models can lead to longer run times and use more resources. You may choose to use more or bigger instances to improve performance; however, this leads to higher costs.
+ By optimizing your models to be more performant, you may be able to lower costs by using fewer or smaller instances while keeping the same or better performance characteristics. You can use [SageMaker Neo](https://aws.amazon.com/sagemaker/neo/) with SageMaker AI Inference to automatically optimize models. For more details and samples, see [Model performance optimization with SageMaker Neo](neo.md).

### Use the most optimal instance type and size for real-time inference.
<a name="collapsible-4"></a>
+ SageMaker Inference has over 70 instance types and sizes that can be used to deploy ML models including AWS Inferentia and Graviton chipsets that are optimized for ML. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models.
+ By using [Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html), you can quickly compare different instances to understand the performance of the model and the costs. With these results, you can choose the instance to deploy with the best return on investment.

### Improve efficiency and costs by combining multiple endpoints into a single endpoint for real-time inference.
<a name="collapsible-5"></a>
+ Costs can quickly add up when you deploy multiple endpoints, especially if the endpoints don’t fully utilize the underlying instances. To understand if the instance is under-utilized, check the utilization metrics (CPU, GPU, etc) in Amazon CloudWatch for your instances. If you have more than one of these endpoints, you can combine the models or containers on these multiple endpoints into a single endpoint.
+ Using [Multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) (MME) or [Multi-container endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html) (MCE), you can deploy multiple ML models or containers in a single endpoint to share the instance across multiple models or containers and improve your return on investment. To learn more, see this [Save on inference costs by using Amazon SageMaker AI multi-model endpoints](https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/) or [Deploy multiple serving containers on a single instance using Amazon SageMaker AI multi-container endpoints](https://aws.amazon.com/blogs/machine-learning/deploy-multiple-serving-containers-on-a-single-instance-using-amazon-sagemaker-multi-container-endpoints/) on the AWS Machine Learning blog.

### Set up autoscaling to match your workload requirements for real-time and asynchronous inference.
<a name="collapsible-6"></a>
+ Without autoscaling, you need to provision for peak traffic or risk model unavailability. Unless the traffic to your model is steady throughout the day, there will be excess unused capacity. This leads to low utilization and wasted resources.
+ [Autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) is an out-of-the-box feature that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the possible lowest cost. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances, helping you reduce your compute cost. To learn more, see [Configuring autoscaling inference endpoints in Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/) on the AWS Machine Learning blog.

# Best practices to minimize interruptions during GPU driver upgrades
<a name="inference-gpu-drivers"></a>

SageMaker AI Model Deployment upgrades GPU drivers on the ML instances for Real-time, Batch, and Asynchronous Inference options over time to provide customers access to improvements from the driver providers. Below you can see the GPU version supported for each Inference option. Different driver versions can change how your model interacts with the GPUs. Below are some strategies to help you understand how your application works with different driver versions. 

## Current versions and supported instance families
<a name="inference-gpu-drivers-versions"></a>

Amazon SageMaker AI Inference supports the following drivers and instance families:

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-gpu-drivers.html)

## Troubleshoot your model container with GPU capabilities
<a name="inference-gpu-drivers-troubleshoot"></a>

If you encounter an issue when running your GPU workload, see the following guidance:

### GPU card detection failure or NVIDIA initialization error
<a name="collapsible-section-0"></a>

Run the `nvidia-smi` (NVIDIA System Management Interface) command from within the Docker container. If the NVIDIA System Management Interface detects a GPU detection error or NVIDIA initialization error, it will return the following error message:

```
Failed to initialize NVML: Driver/library version mismatch
```

Based on your use case, follow these best practices to resolve the failure or error:
+ Follow the best practice recommendation described in the [If you bring your own (BYO) model containers](#collapsible-byoc) dropdown.
+ Follow the best practice recommendation described in the [If you use a CUDA compatibility layer](#collapsible-cuda-compat) dropdown.

Refer to the [NVIDIA System Management Interface page](https://developer.nvidia.com/nvidia-system-management-interface) on the NVIDIA website for more information.

### `CannotStartContainerError`
<a name="collapsible-section-cannot-start-container"></a>

 If your GPU instance uses NVIDIA driver versions that are not compatible with the CUDA version in the Docker container, then deploying an endpoint will fail with the following error message: 

```
 Failure reason CannotStartContainerError. Please ensure the model container for variant <variant_name> starts correctly when invoked with 'docker run <image> serve'
```

Based on your use case, follow these best practices to resolve the failure or error:
+ Follow the best practice recommendation described in the [The driver my container depends on is greater than the version on the ML GPU instances](#collapsible-driver-dependency-higher) dropdown.
+ Follow the best practice recommendation described in the [If you use a CUDA compatibility layer](#collapsible-cuda-compat) dropdown.

## Best practices for working with mismatched driver versions
<a name="inference-gpu-drivers-cuda-toolkit-updates"></a>

The following provides information on how to update your GPU driver:

### The driver my container depends on is lower than the version on the ML GPU instance
<a name="collapsible-driver-dependency-lower"></a>

No action is required. NVIDIA provides backwards compatibility.

### The driver my container depends on is greater than the version on the ML GPU instances
<a name="collapsible-driver-dependency-higher"></a>

If it is a minor version difference, no action is required. NVIDIA provides minor version forward compatibility.

If it is a major version difference, the CUDA Compatibility Package will need to be installed. Please refer to [CUDA Compatibility Package](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) in the NVIDIA documentation.

**Important**  
The CUDA Compatibility Package is not backwards compatible so it needs to be disabled if the driver version on the instance is greater than the CUDA Compatibility Package version.

### If you bring your own (BYO) model containers
<a name="collapsible-byoc"></a>

Ensure no NVIDIA driver packages are bundled in the image which could cause conflict with on host NVIDIA driver version.

### If you use a CUDA compatibility layer
<a name="collapsible-cuda-compat"></a>

To verify if the platform Nvidia driver version supports the CUDA Compatibility Package version installed in the model container, see the [CUDA documentation](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package). If the platform Nvidia driver version does not support the CUDA Compatibility Package version, you can disable or remove the CUDA Compatibility Package from the model container image. If the CUDA compatibility libs version is supported by the latest Nvidia driver version, we suggest that you enable the CUDA Compatibility Package based on the detected Nvidia driver version for future compatibility by adding the code snippet below into the container start up shell script (at the `ENTRYPOINT` script).

The script demonstrates how to dynamically switch the use of the CUDA Compatibility Package based on the detected Nvidia driver version on the deployed host for your model container. When SageMaker releases a newer Nvidia driver version, the installed CUDA Compatibility Package can be turned off automatically if the CUDA application is supported natively on the new driver.

```
#!/bin/bash

verlt() {
    [ "$1" = "$2" ] && return 1 || [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ]
}

if [ -f /usr/local/cuda/compat/libcuda.so.1 ]; then
    CUDA_COMPAT_MAX_DRIVER_VERSION=$(readlink /usr/local/cuda/compat/libcuda.so.1 | cut -d'.' -f 3-)
    echo "CUDA compat package should be installed for NVIDIA driver smaller than ${CUDA_COMPAT_MAX_DRIVER_VERSION}"
    NVIDIA_DRIVER_VERSION=$(sed -n 's/^NVRM.*Kernel Module *\([0-9.]*\).*$/\1/p' /proc/driver/nvidia/version 2>/dev/null || true)
    echo "Current installed NVIDIA driver version is ${NVIDIA_DRIVER_VERSION}"
    if verlt $NVIDIA_DRIVER_VERSION $CUDA_COMPAT_MAX_DRIVER_VERSION; then
        echo "Adding CUDA compat to LD_LIBRARY_PATH"
        export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
        echo $LD_LIBRARY_PATH
    else
        echo "Skipping CUDA compat setup as newer NVIDIA driver is installed"
    fi
else
    echo "Skipping CUDA compat setup as package not found"
fi
```

# Best practices for endpoint security and health with Amazon SageMaker AI
<a name="best-practice-endpoint-security"></a>

To address the latest security issues, Amazon SageMaker AI automatically patches endpoints to the latest and most secure software. However, if you incorrectly modify your endpoint dependencies, Amazon SageMaker AI can't automatically patch your endpoints or replace your unhealthy instances. To ensure your endpoints remain eligible for automatic updates, apply the following best practices.

## Don't delete resources while your endpoints use them
<a name="dont-delete-resources-in-use"></a>

Avoid deleting any of the following resources if you have existing endpoints that use them:
+ The model definition that you create with the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) action in the Amazon SageMaker API.
+ Any model artifacts that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-ModelDataUrl](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-ModelDataUrl) parameter.
+ The IAM role and permissions that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ExecutionRoleArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ExecutionRoleArn) parameter.
**Reminder**  
In the model definition that your endpoint uses, ensure that the IAM role that you specified has the correct permissions. For more information about the required permissions for Amazon SageMaker AI endpoints, see [CreateModel API: Execution Role Permissions](sagemaker-roles.md#sagemaker-roles-createmodel-perms).
+ The inference images that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-Image](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-Image) parameter, if you use your own inference code.
**Reminder**  
If you use the private registry feature, ensure that Amazon SageMaker AI can access the private registry as long as you're using the endpoint.
+ The Amazon VPC subnets and security groups that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-VpcConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-VpcConfig) parameter.
+ The endpoint configuration that you create with the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) action in the Amazon SageMaker API.
+ Any KMS keys or Amazon S3 buckets that you specify in the endpoint configuration.
**Reminder**  
Ensure you don’t disable these KMS keys.

## Follow these procedures to update your endpoints
<a name="procedures-to-update-endpoint"></a>

When you update your Amazon SageMaker AI endpoints, use any of the following procedures that apply to your needs.

**To update your model definition settings**

1. Create a new model definition with your updated settings by using the CreateModel action in the Amazon SageMaker API.

1. Create a new endpoint configuration that uses the new model definition. To do this, use the CreateEndpointConfig action in the Amazon SageMaker API.

1. Update your endpoint with the new endpoint configuration so that your updated model definition settings take effect.

1. (Optional) Delete the old endpoint configuration if you're not using it with any other endpoints. You can also delete the resources that you specified in the model definition if you're not using them with any other endpoints. These resources include model artifacts in Amazon S3 and inference images.

**To update your endpoint configuration**

1. Create a new endpoint configuration with your updated settings.

1. Update your endpoint with the new configuration so that your updates take effect.

1. (Optional) Delete the old endpoint configuration if you're not using it with any other endpoints. You can also delete the resources that you specified in the model definition if you're not using them with any other endpoints. These resources include model artifacts in Amazon S3 and inference images.

Whenever you create a new model definition or endpoint configuration, we recommend that you use a unique name. If you want to update these resources and retain their original names, use the following procedures.

**To update your model settings and retain the original model name**

1. Delete the existing model definition. At this point, any endpoint that uses the model is broken, but you fix this in the following steps.

1. Create the model definition again with your updated settings, and use the same model name.

1. Create a new endpoint configuration that uses the updated model definition.

1. Update your endpoint with the new endpoint configuration so that your updates take effect.

**To update your endpoint configuration and retain the original configuration name**

1. Delete the existing endpoint configuration.

1. Create a new endpoint configuration with your updated settings, and use the original name.

1. Update your endpoint with the new configuration so that your updates take effect.

# Updating inference containers to comply with the NVIDIA Container Toolkit
<a name="container-nvidia-compliance"></a>

As of versions 1.17.4 and higher, the NVIDIA Container Toolkit no longer mounts CUDA compatibility libraries automatically. This change in behavior could affect your SageMaker AI inference workloads. Your SageMaker AI endpoints and batch transform jobs might use containers that are incompatible with the latest versions of the NVIDIA Container Toolkit. To ensure that your workloads comply with the latest requirements, you might need to update your endpoints or configure your batch transform jobs.

## Updating SageMaker AI endpoints for compliance
<a name="endpoint-compliance"></a>

We recommend that you update your existing SageMaker AI endpoints or create new ones that support the latest default behavior.

To ensure your endpoint is compatible with latest versions of the NVIDIA Container Toolkit, follow these steps:

1. Update how you set up the CUDA compatibility libraries if you bring your own container.

1. Specify an inference Amazon Machine Image (AMI) that supports the latest NVIDIA Container Toolkit behavior. You specify an AMI when you update an existing endpoint or create a new one.

### Updating the CUDA compatibility setup if you bring your own container
<a name="cuda-compatibility"></a>

The CUDA compatibility libraries enable forward compatibility. This compatibility applies to any CUDA toolkit versions that are newer than the NVIDIA driver provided by the SageMaker AI instance.

You must enable the CUDA compatibility libraries only when the NVIDIA driver that the SageMaker AI instance uses has an older version than the CUDA toolkit in the model container. If your model container does not require CUDA compatibility, you can skip this step. For example, you can skip this step if you don't plan to use a newer CUDA toolkit than those provided by SageMaker AI instances.

Because of the changes introduced in the NVIDIA Container Toolkit version 1.17.4, you can explicitly enable CUDA compatibility libraries, if needed, by adding them to `LD_LIBRARY_PATH` in the container.

We suggest that you enable the CUDA compatibility based on the detected NVIDIA driver version. To enable it, add the code snippet below to the container startup shell script. Add this code at the `ENTRYPOINT` script.

The following script demonstrates how to dynamically switch the use of the CUDA compatibility based on the detected NVIDIA driver version on the deployed host for your model container.

```
#!/bin/bash

verlt() {
    [ "$1" = "$2" ] && return 1 || [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ]
}

if [ -f /usr/local/cuda/compat/libcuda.so.1 ]; then
    CUDA_COMPAT_MAX_DRIVER_VERSION=$(readlink /usr/local/cuda/compat/libcuda.so.1 | cut -d'.' -f 3-)
    echo "CUDA compat package should be installed for NVIDIA driver smaller than ${CUDA_COMPAT_MAX_DRIVER_VERSION}"
    NVIDIA_DRIVER_VERSION=$(sed -n 's/^NVRM.*Kernel Module *\([0-9.]*\).*$/\1/p' /proc/driver/nvidia/version 2>/dev/null || true)
    echo "Current installed NVIDIA driver version is ${NVIDIA_DRIVER_VERSION}"
    if verlt $NVIDIA_DRIVER_VERSION $CUDA_COMPAT_MAX_DRIVER_VERSION; then
        echo "Adding CUDA compat to LD_LIBRARY_PATH"
        export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
        echo $LD_LIBRARY_PATH
    else
        echo "Skipping CUDA compat setup as newer NVIDIA driver is installed"
    fi
else
    echo "Skipping CUDA compat setup as package not found"
fi
```

### Specifying an Inference AMI that complies with the NVIDIA Container Toolkit
<a name="specify-inference-ami"></a>

In the `InferenceAmiVersion` parameter of the `ProductionVariant` data type, you can select the AMI for a SageMaker AI endpoint. Each of the supported AMIs is a preconfigured image. Each image is configured by AWS with a set of software and driver versions.

By default, the SageMaker AI AMIs follow the legacy behavior. They automatically mount CUDA compatibility libraries in the container. To make an endpoint use the new behavior, you must specify an inference AMI version that is configured for the new behavior.

The following inference AMI versions currently follow the new behavior. They don't mount CUDA compatibility libraries automatically.

al2-ami-sagemaker-inference-gpu-2-1  
+ NVIDIA driver version: 535.54.03
+ CUDA version: 12.2

al2-ami-sagemaker-inference-gpu-3-1  
+ NVIDIA driver version: 550.144.01
+ CUDA version: 12.4

### Updating an existing endpoint
<a name="update-existing-endpoint"></a>

Use the following example to update an existing endpoint. The example uses an inference AMI version that disables automatic mounting of CUDA compatibility libraries.

```
ENDPOINT_NAME="<endpoint name>"
INFERENCE_AMI_VERSION="al2-ami-sagemaker-inference-gpu-3-1"

# Obtaining current endpoint configuration
CURRENT_ENDPOINT_CFG_NAME=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointConfigName" --output text)
NEW_ENDPOINT_CFG_NAME="${CURRENT_ENDPOINT_CFG_NAME}new"

# Copying Endpoint Configuration with AMI version specified
aws sagemaker describe-endpoint-config \
    --endpoint-config-name ${CURRENT_ENDPOINT_CFG_NAME} \
    --output json | \
jq "del(.EndpointConfigArn, .CreationTime) | . + {
    EndpointConfigName: \"${NEW_ENDPOINT_CFG_NAME}\",
    ProductionVariants: (.ProductionVariants | map(.InferenceAmiVersion = \"${INFERENCE_AMI_VERSION}\"))
}" > /tmp/new_endpoint_config.json

# Make sure all fields in the new endpoint config look as expected
cat /tmp/new_endpoint_config.json

# Creating new endpoint config
aws sagemaker create-endpoint-config \
   --cli-input-json file:///tmp/new_endpoint_config.json
    
# Updating the endpoint
aws sagemaker update-endpoint \
    --endpoint-name "$ENDPOINT_NAME" \
    --endpoint-config-name "$NEW_ENDPOINT_CFG_NAME" \
    --retain-all-variant-properties
```

### Creating a new endpoint
<a name="create-new-endpoint"></a>

Use the following example to create a new endpoint. The example uses an inference AMI version that disables automatic mounting of CUDA compatibility libraries.

```
INFERENCE_AMI_VERSION="al2-ami-sagemaker-inference-gpu-3-1"

aws sagemakercreate-endpoint-config \
 --endpoint-config-name "<endpoint_config>" \
 --production-variants '[{ \
    ....
    "InferenceAmiVersion":  "${INFERENCE_AMI_VERSION}", \
    ...
    "}]'

aws sagemaker create-endpoint \
--endpoint-name "<endpoint_name>" \
--endpoint-config-name "<endpoint_config>"
```

## Running compliant batch transform jobs
<a name="batch-compliance"></a>

*Batch transform* is the inference option that's best suited for requests to process large amounts of data offline. To create batch transform jobs, you use the `CreateTransformJob` API action. For more information, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

The changed behavior of the NVIDIA Container Toolkit affects batch transform jobs. To run a batch transform that complies with the NVIDIA Container Toolkit requirements, do the following:

1. If you want to run batch transform with a model for which you've brought your own container, first, update the container for CUDA compatibility. To update it, follow the process in [Updating the CUDA compatibility setup if you bring your own container](#cuda-compatibility).

1. Use the `CreateTransformJob` API action to create the batch transform job. In your request, set the `SAGEMAKER_CUDA_COMPAT_DISABLED` environment variable to `true`. This parameter instructs to the container not to automatically mount CUDA compatibility libraries.

   For example, when you create a batch transform job by using the AWS CLI, you set the environment variable with the `--environment` parameter:

   ```
   aws sagemaker create-transform-job \
       --environment '{"SAGEMAKER_CUDA_COMPAT_DISABLED": "true"}'\
       . . .
   ```