

# Deploy models for inference
<a name="deploy-model"></a>

With Amazon SageMaker AI, you can start getting predictions, or *inferences*, from your trained machine learning models. SageMaker AI provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker AI Inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden. SageMaker AI provides you with various inference options, such as real-time endpoints for getting low latency inference, serverless endpoints for fully managed infrastructure and auto-scaling, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient model deployment and inference.

## Choosing a feature
<a name="deploy-model-choose"></a>

There are several use cases for deploying ML models with SageMaker AI. This section describes those use cases, as well as the SageMaker AI feature we recommend for each use case. 

### Use cases
<a name="deploy-model-use-cases"></a>

The following are the main uses cases for deploying ML models with SageMaker AI.
+ **Use case 1: Deploy a machine learning model in a low-code or no-code environment.** For beginners or those new to SageMaker AI, you can deploy pre-trained models using Amazon SageMaker JumpStart through the Amazon SageMaker Studio interface, without the need for complex configurations.
+ **Use case 2: Use code to deploy machine learning models with more flexibility and control.** Experienced ML practitioners can deploy their own models with customized settings for their application needs using the `ModelBuilder` class in the SageMaker AI Python SDK, which provides fine-grained control over various settings, such as instance types, network isolation, and resource allocation.
+ **Use case 3: Deploy machine learning models at scale.** For advanced users and organizations who want to manage models at scale in production, use the AWS SDK for Python (Boto3) and CloudFormation along with your desired Infrastructure as Code (IaC) and CI/CD tools to provision resources and automate resource management.

### Recommended features
<a name="deploy-model-recommended"></a>

The following table describes key considerations and tradeoffs for SageMaker AI features corresponding with each use case.


|  | Use case 1 | Use case 2 | Use case 3 | 
| --- | --- | --- | --- | 
| SageMaker AI feature | Use [ JumpStart in Studio](jumpstart-foundation-models-use-studio-updated.md) to accelerate your foundational model deployment. | Deploy models using [ModelBuilder from the SageMaker Python SDK](how-it-works-modelbuilder-creation.md). |  [Deploy and manage models at scale with CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_SageMaker.html). | 
| Description | Use the Studio UI to deploy pre-trained models from a catalog to pre-configured inference endpoints. This option is ideal for citizen data scientists, or for anyone who wants to deploy a model without configuring complex settings. | Use the ModelBuilder class from the Amazon SageMaker AI Python SDK to deploy your own model and configure deployment settings. This option is ideal for experienced data scientists, or for anyone who has their own model to deploy and requires fine-grained control. | Use CloudFormation and Infrastructure as Code (IaC) for programmatic control and automation for deploying and managing SageMaker AI models. This option is ideal for advanced users who require consistent and repeatable deployments. | 
| Optimized for | Fast and streamlined deployments of popular open source models | Deploying your own models | Ongoing management of models in production | 
| Considerations | Lack of customization for container settings and specific application needs | No UI, requires that you're comfortable developing and maintaining Python code | Requires infrastructure management and organizational resources, and also requires familiarity with the AWS SDK for Python (Boto3) or with CloudFormation templates. | 
| Recommended environment | A SageMaker AI domain | A Python development environment configured with your AWS credentials and the SageMaker Python SDK installed, or a SageMaker AI IDE such as [SageMaker JupyterLab](studio-updated-jl.md) | The AWS CLI, a local development environment, and Infrastructure as Code (IaC) and CI/CD tools | 

### Additional options
<a name="deploy-model-additional"></a>

SageMaker AI provides different options for your inference use cases, giving you choice over the technical breadth and depth of your deployments:
+ **Deploying a model to an endpoint.** When deploying your model, consider the following options:
  + [Real-time inference](realtime-endpoints.md). Real-time inference is ideal for inference workloads where you have interactive, low latency requirements.
  + [Deploy models with Amazon SageMaker Serverless Inference](serverless-endpoints.md). Use Serverless Inference to deploy models without configuring or managing any of the underlying infrastructure. This option is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.
  + [Asynchronous inference](async-inference.md). Queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements.
+ **Cost optimization.** To optimize your inference costs, consider the following options:
  + [Model performance optimization with SageMaker Neo](neo.md). Use SageMaker Neo to optimize and run your machine learning models with better performance and efficiency, helping you to minimize compute costs by automatically optimizing models to run in environments like AWS Inferentia chips.
  + [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md). Use autoscaling to dynamically adjust the compute resources for your endpoints based on incoming traffic patterns, which helps you optimize costs by only paying for the resources you're using at a given time.

# Model deployment options in Amazon SageMaker AI
<a name="how-it-works-deployment"></a>

After you train your machine learning model, you can deploy it using Amazon SageMaker AI to get predictions. Amazon SageMaker AI supports the following ways to deploy a model, depending on your use case:
+ For persistent, real-time endpoints that make one prediction at a time, use SageMaker AI real-time hosting services. See [Real-time inference](realtime-endpoints.md).
+ Workloads that have idle periods between traffic spikes and can tolerate cold starts, use Serverless Inference. See [Deploy models with Amazon SageMaker Serverless Inference](serverless-endpoints.md).
+ Requests with large payload sizes up to 1GB, long processing times, and near real-time latency requirements, use Amazon SageMaker Asynchronous Inference. See [Asynchronous inference](async-inference.md).
+ To get predictions for an entire dataset, use SageMaker AI batch transform. See [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

SageMaker AI also provides features to manage resources and optimize inference performance when deploying machine learning models:
+ To manage models on edge devices so that you can optimize, secure, monitor, and maintain machine learning models on fleets of edge devices, see [Model deployment at the edge with SageMaker Edge Manager](edge.md). This applies to edge devices like smart cameras, robots, personal computers, and mobile devices.
+ To optimize Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, and ONNX models for inference on Android, Linux, and Windows machines based on processors from Ambarella, ARM, Intel, Nvidia, NXP, Qualcomm, Texas Instruments, and Xilinx, see [Model performance optimization with SageMaker Neo](neo.md).

For more information about all deployment options, see [Deploy models for inference](deploy-model.md).

# Understand the options for deploying models and getting inferences in Amazon SageMaker AI
<a name="deploy-model-get-started"></a>

To help you get started with SageMaker AI Inference, see the following sections which explain your options for deploying your model in SageMaker AI and getting inferences. The [Inference options in Amazon SageMaker AI](deploy-model-options.md) section can help you determine which feature best fits your use case for inference.

You can refer to the [Resources](inference-resources.md) section for more troubleshooting and reference information, blogs and examples to help you get started, and common FAQs.

**Topics**
+ [

## Before you begin
](#deploy-model-prereqs)
+ [

## Steps for model deployment
](#deploy-model-steps)
+ [

# Inference options in Amazon SageMaker AI
](deploy-model-options.md)
+ [

# Advanced endpoint options for inference with Amazon SageMaker AI
](deploy-model-advanced.md)
+ [

# Next steps for inference with Amazon SageMaker AI
](deploy-model-next-steps.md)

## Before you begin
<a name="deploy-model-prereqs"></a>

These topics assume that you have built and trained one or more machine learning models and are ready to deploy them. You don't need to train your model in SageMaker AI in order to deploy your model in SageMaker AI and get inferences. If you don't have your own model, you can also use SageMaker AI’s [built-in algorithms or pre-trained models](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).

If you are new to SageMaker AI and haven't picked out a model to deploy, work through the steps in the [Get Started with Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html) tutorial. Use the tutorial to get familiar with how SageMaker AI manages the data science process and how it handles model deployment. For more information about training a model, see [Train Models](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html).

For additional information, reference, and examples, see the [Resources](inference-resources.md).

## Steps for model deployment
<a name="deploy-model-steps"></a>

For inference endpoints, the general workflow consists of the following:
+ Create a model in SageMaker AI Inference by pointing to model artifacts stored in Amazon S3 and a container image.
+ Select an inference option. For more information, see [Inference options in Amazon SageMaker AI](deploy-model-options.md).
+ Create a SageMaker AI Inference endpoint configuration by choosing the instance type and number of instances you need behind the endpoint. You can use [Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html) to get recommendations for instance types. For Serverless Inference, you only need to provide the memory configuration you need based on your model size. 
+ Create a SageMaker AI Inference endpoint.
+ Invoke your endpoint to receive an inference as a response.

The following diagram shows the preceding workflow.

![\[The workflow described in the preceding paragraph showing how to get inferences from SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference-workflow-flowchart.png)


You can perform these actions using the AWS console, the AWS SDKs, the SageMaker Python SDK, CloudFormation or the AWS CLI.

For batch inference with batch transform, point to your model artifacts and input data and create a batch inference job. Instead of hosting an endpoint for inference, SageMaker AI outputs your inferences to an Amazon S3 location of your choice.

# Inference options in Amazon SageMaker AI
<a name="deploy-model-options"></a>

SageMaker AI provides multiple inference options so that you can pick the option that best suits your workload:
+ [Real-Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html): *Real-time inference* is ideal for online inferences that have low latency or high throughput requirements. Use real-time inference for a persistent and fully managed endpoint (REST API) that can handle sustained traffic, backed by the instance type of your choice. Real-time inference can support payload sizes up to 25 MB and processing times of 60 seconds for regular responses and 8 min for streaming responses.
+ [Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html): *Serverless inference* is ideal when you have intermittent or unpredictable traffic patterns. SageMaker AI manages all of the underlying infrastructure, so there’s no need to manage instances or scaling policies. You pay only for what you use and not for idle time. It can support payload sizes up to 4 MB and processing times up to 60 seconds.
+ [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html): *Batch transform* is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint. You can also use batch transform for pre-processing datasets. It can support large datasets that are GBs in size and processing times of days.
+ [Asynchronous Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html): *Asynchronous inference* is ideal when you want to queue requests and have large payloads with long processing times. Asynchronous Inference can support payloads up to 1 GB and long processing times up to one hour. You can also scale down your endpoint to 0 when there are no requests to process.

# Advanced endpoint options for inference with Amazon SageMaker AI
<a name="deploy-model-advanced"></a>

With real-time inference, you can further optimize for performance and cost with the following advanced inference options:
+ [Multi-model endpoints](multi-model-endpoints.md) – Use this option if you have multiple models that use the same framework and can share a container. This option helps you optimize costs by improving endpoint utilization and reducing deployment overhead.
+ [Multi-container endpoints](multi-container-endpoints.md) – Use this option if you have multiple models that use different frameworks and require their own containers. You get many of the benefits of Multi-Model Endpoints and can deploy a variety of frameworks and models.
+ [Serial Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) – Use this option if you want to host models with pre-processing and post-processing logic behind an endpoint. Inference pipelines are fully managed by SageMaker AI and provide lower latency because all of the containers are hosted on the same Amazon EC2 instances.

# Next steps for inference with Amazon SageMaker AI
<a name="deploy-model-next-steps"></a>

After you have an endpoint and understand the general inference workflow, you can use the following features in SageMaker AI to improve your inference workflow.

## Monitoring
<a name="deploy-model-next-steps-monitoring"></a>

To track your model over time through metrics such as model accuracy and drift, you can use Model Monitor. With Model Monitor, you can set alerts that notify you when there are deviations in your model’s quality. To learn more, see the [Model Monitor documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html). 

To learn more about tools that can be used to monitor model deployments and events that change your endpoint, see [Monitor Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-overview.html). For example, you can monitor your endpoint’s health through metrics such as invocation errors and model latency using Amazon CloudWatch metrics. The [SageMaker AI endpoint invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation) can provide you with valuable information about your endpoint’s performance.

## CI/CD for model deployment
<a name="deploy-model-next-steps-cicd"></a>

To put together machine learning solutions in SageMaker AI, you can use [SageMaker AI MLOps](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects.html). You can use this feature to automate the steps in your machine learning workflow and practice CI/CD. You can use [MLOps Project Templates](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates.html) to help with the setup and implementation of SageMaker AI MLOps projects. SageMaker AI also supports using your own [third-party Git repo](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough-3rdgit.html) for creating a CI/CD system.

For your ML pipelines, use [Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) to manage your model versions and the deployment and automation of your models.

## Deployment guardrails
<a name="deploy-model-next-steps-guardrails"></a>

If you want to update your model while it’s in production without impacting production, you can use deployment guardrails. Deployment guardrails are a set of model deployment options in SageMaker AI Inference to update your machine learning models in production. Using the fully managed deployment options, you can control the switch from the current model in production to a new one. Traffic shifting modes give you granular control over the traffic shifting process, and built-in safeguards like auto-rollbacks help you catch issues early on. 

To learn more about deployment guardrails, see the [deployment guardrails documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails.html).

## Inferentia
<a name="deploy-model-next-steps-inferentia"></a>

If you need to run large-scale machine learning and deep learning applications, you can use an `Inf1` instance with a real-time endpoint. This instance type is suitable for use cases such as image or speech recognition, natural language processing (NLP), personalization, forecasting, or fraud detection.

`Inf1` instances are built to support machine learning inference applications and feature the AWS Inferentia chips. `Inf1` instances provide higher throughput and lower cost per inference than GPU-based instances.

To deploy a model on `Inf1` instances, compile your model with SageMaker Neo and choose an `Inf1` instance for your deployment option. To learn more, see [Optimize model performance using SageMaker Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html).

## Optimize model performance
<a name="deploy-model-next-steps-optimize"></a>

SageMaker AI provides features to manage resources and optimize inference performance when deploying machine learning models. You can use SageMaker AI’s [built-in algorithms and pre-built models](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), as well as [prebuilt Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html), which are developed for machine learning.

To train models and optimize them for deployment, see [prebuilt Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html)[Optimize model performance using SageMaker Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html). With SageMaker Neo, you can train TensorFlow, Apache MXNet, PyTorch, ONNX, and XGBoost models. Then, you can optimize them and deploy on ARM, Intel, and Nvidia processors.

## Autoscaling
<a name="deploy-model-next-steps-autoscaling"></a>

If you have varying amounts of traffic to your endpoints, you might want to try autoscaling. For example, during peak hours, you might require more instances to process requests. However, during periods of low traffic, you might want to reduce your use of computing resources. To dynamically adjust the number of instances provisioned in response to changes in your workload, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md).

If you have unpredictable traffic patterns or don’t want to set up scaling policies, you can also use Serverless Inference for an endpoint. Then, SageMaker AI manages autoscaling for you. During periods of low traffic, SageMaker AI scales down your endpoint, and if traffic increases, then SageMaker AI scales your endpoint up. For more information, see the [Deploy models with Amazon SageMaker Serverless Inference](serverless-endpoints.md) documentation.

# Create a model in Amazon SageMaker AI with ModelBuilder
<a name="how-it-works-modelbuilder-creation"></a>

Preparing your model for deployment on a SageMaker AI endpoint requires multiple steps, including choosing a model image, setting up the endpoint configuration, coding your serialization and deserialization functions to transfer data to and from server and client, identifying model dependencies, and uploading them to Amazon S3. `ModelBuilder` can reduce the complexity of initial setup and deployment to help you create a deployable model in a single step.

`ModelBuilder` performs the following tasks for you: 
+ Converts machine learning models trained using various frameworks like XGBoost or PyTorch into deployable models in one step.
+ Performs automatic container selection based on the model framework so you don’t have to manually specify your container. You can still bring your own container by passing your own URI to `ModelBuilder`.
+ Handles the serialization of data on the client side before sending it to the server for inference and deserialization of the results returned by the server. Data is correctly formatted without manual processing.
+ Enables automatic capture of dependencies and packages the model according to model server expectations. `ModelBuilder`'s automatic capture of dependencies is a best-effort approach to dynamically load dependencies. (We recommend that you test the automated capture locally and update the dependencies to meet your needs.)
+ For large language model (LLM) use cases, optionally performs local parameter tuning of serving properties that can be deployed for better performance when hosting on a SageMaker AI endpoint.
+ Supports most of the popular model servers and containers like TorchServe, Triton, DJLServing and TGI container.

## Build your model with ModelBuilder
<a name="how-it-works-modelbuilder-creation-mb"></a>

`ModelBuilder` is a Python class that takes a framework model, such as XGBoost or PyTorch, or a user-specified inference specification and converts it into a deployable model. `ModelBuilder` provides a build function that generates the artifacts for deployment. The model artifact generated is specific to the model server, which you can also specify as one of the inputs. For more details about the `ModelBuilder` class, see [ModelBuilder](https://sagemaker.readthedocs.io/en/stable/api/inference/model_builder.html#sagemaker.serve.builder.model_builder.ModelBuilder).

The following diagram illustrates the overall model creation workflow when you use `ModelBuilder`. `ModelBuilder` accepts a model or inference specification along with your schema to create a deployable model that you can test locally before deployment.

![\[Model creation and deployment flow using ModelBuilder.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-builder-flow.png)


`ModelBuilder` can handle any customization you want to apply. However, to deploy a framework model, the model builder expects at minimum a model, sample input and output, and the role. In the following code example, `ModelBuilder` is called with a framework model and an instance of `SchemaBuilder` with minimum arguments (to infer the corresponding functions for serializing and deserializing the endpoint input and output). No container is specified and no packaged dependencies are passed—SageMaker AI automatically infers these resources when you build your model. 

```
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder

model_builder = ModelBuilder(
    model=model,
    schema_builder=SchemaBuilder(input, output),
    role_arn="execution-role",
)
```

The following code sample invokes `ModelBuilder` with an inference specification (as an `InferenceSpec` instance) instead of a model, with additional customization. In this case, the call to model builder includes a path to store model artifacts and also turns on autocapture of all available dependencies. For additional details about `InferenceSpec`, see [Customize model loading and handling of requests](#how-it-works-modelbuilder-creation-is).

```
model_builder = ModelBuilder(
    mode=Mode.LOCAL_CONTAINER,
    model_path=model-artifact-directory,
    inference_spec=your-inference-spec,
    schema_builder=SchemaBuilder(input, output),
    role_arn=execution-role,
    dependencies={"auto": True}
)
```

## Define serialization and deserialization methods
<a name="how-it-works-modelbuilder-creation-sb"></a>

When invoking a SageMaker AI endpoint, the data is sent through HTTP payloads with different MIME types. For example, an image sent to the endpoint for inference needs to be converted to bytes at the client side and sent through an HTTP payload to the endpoint. When the endpoint receives the payload, it needs to deserialize the byte string back to the data type that is expected by the model (also known as server-side deserialization). After the model finishes prediction, the results also need to be serialized to bytes that can be sent back through the HTTP payload to the user or the client. Once the client receives the response byte data, it needs to perform client-side deserialization to convert the bytes data back to the expected data format, such as JSON. At minimum, you need to convert data for the following tasks:

1. Inference request serialization (handled by the client)

1. Inference request deserialization (handled by the server or algorithm)

1. Invoking the model against the payload and send response payload back

1. Inference response serialization (handled by the server or algorithm)

1. Inference response deserialization (handled by the client)

The following diagram shows the serialization and deserialization processes that occur when you invoke the endpoint.

![\[Diagram of client to server data serialization and deserialization.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-builder-serialization.png)


When you supply sample input and output to `SchemaBuilder`, the schema builder generates the corresponding marshalling functions for serializing and deserializing the input and output. You can further customize your serialization functions with `CustomPayloadTranslator`. But for most cases, a simple serializer such as the following would work:

```
input = "How is the demo going?"
output = "Comment la démo va-t-elle?"
schema = SchemaBuilder(input, output)
```

For further details about `SchemaBuilder`, see [SchemaBuilder](https://sagemaker.readthedocs.io/en/stable/api/inference/model_builder.html#sagemaker.serve.builder.schema_builder.SchemaBuilder).

The following code snippet outlines an example where you want to customize both serialization and deserialization functions at the client and server sides. You can define your own request and response translators with `CustomPayloadTranslator` and pass these translators to `SchemaBuilder`.

By including the inputs and outputs with the translators, the model builder can extract the data format the model expects. For example, suppose the sample input is a raw image, and your custom translators crop the image and send the cropped image to the server as a tensor. `ModelBuilder` needs both the raw input and any custom preprocessing or postprocessing code to derive a method to convert data on both the client and server sides.

```
from sagemaker.serve import CustomPayloadTranslator

# request translator
class MyRequestTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on client side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the input payload to bytes
        ... ...
        return  //return object as bytes

    # This function converts the bytes to payload - happens on server side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object

# response translator
class MyResponseTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on server side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the response payload to bytes
        ... ...
        return //return object as bytes

    # This function converts the bytes to payload - happens on client side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object
```

You pass in the sample input and output along with the previously-defined custom translators when you create the `SchemaBuilder` object, as shown in the following example:

```
my_schema = SchemaBuilder(
    sample_input=image,
    sample_output=output,
    input_translator=MyRequestTranslator(),
    output_translator=MyResponseTranslator()
)
```

Then you pass in the sample input and output, along with the custom translators defined previously, to the `SchemaBuilder` object. 

```
my_schema = SchemaBuilder(
    sample_input=image,
    sample_output=output,
    input_translator=MyRequestTranslator(),
    output_translator=MyResponseTranslator()
)
```

The following sections explain in detail how to build your model with `ModelBuilder` and use its supporting classes to customize the experience for your use case.

**Topics**
+ [

## Build your model with ModelBuilder
](#how-it-works-modelbuilder-creation-mb)
+ [

## Define serialization and deserialization methods
](#how-it-works-modelbuilder-creation-sb)
+ [

## Customize model loading and handling of requests
](#how-it-works-modelbuilder-creation-is)
+ [

## Build your model and deploy
](#how-it-works-modelbuilder-creation-deploy)
+ [

## Bring your own container (BYOC)
](#how-it-works-modelbuilder-creation-mb-byoc)
+ [

## Using ModelBuilder in local mode
](#how-it-works-modelbuilder-creation-local)
+ [

## ModelBuilder examples
](#how-it-works-modelbuilder-creation-example)

## Customize model loading and handling of requests
<a name="how-it-works-modelbuilder-creation-is"></a>

Providing your own inference code through `InferenceSpec` offers an additional layer of customization. With `InferenceSpec`, you can customize how the model is loaded and how it handles incoming inference requests, bypassing its default loading and inference handling mechanisms. This flexibility is particularly beneficial when working with non-standard models or custom inference pipelines. You can customize the `invoke` method to control how the model preprocesses and postprocesses incoming requests. The `invoke` method ensures that the model handles inference requests correctly. The following example uses `InferenceSpec` to generate a model with the HuggingFace pipeline. For further details about `InferenceSpec`, refer to the [InferenceSpec](https://sagemaker.readthedocs.io/en/stable/api/inference/model_builder.html#sagemaker.serve.spec.inference_spec.InferenceSpec).

```
from sagemaker.serve.spec.inference_spec import InferenceSpec
from transformers import pipeline

class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return pipeline("translation_en_to_fr", model="t5-small")

    def invoke(self, input, model):
        return model(input)

inf_spec = MyInferenceSpec()

model_builder = ModelBuilder(
    inference_spec=your-inference-spec,
    schema_builder=SchemaBuilder(X_test, y_pred)
)
```

The following example illustrates a more customized variation of a previous example. A model is defined with an inference specification that has dependencies. In this case, the code in the inference specification is dependent on the *lang-segment* package. The argument for `dependencies` contains a statement that directs the builder to install *lang-segment* using Git. Since the model builder is directed by the user to custom install a dependency, the `auto` key is `False` to turn off autocapture of dependencies.

```
model_builder = ModelBuilder(
    mode=Mode.LOCAL_CONTAINER,
    model_path=model-artifact-directory,
    inference_spec=your-inference-spec,
    schema_builder=SchemaBuilder(input, output),
    role_arn=execution-role,
    dependencies={"auto": False, "custom": ["-e git+https://github.com/luca-medeiros/lang-segment-anything.git#egg=lang-sam"],}
)
```

## Build your model and deploy
<a name="how-it-works-modelbuilder-creation-deploy"></a>

Call the `build` function to create your deployable model. This step creates inference code (as `inference.py`) in your working directory with the code necessary to create your schema, run serialization and deserialization of inputs and outputs, and run other user-specified custom logic.

As an integrity check, SageMaker AI packages and pickles the necessary files for deployment as part of the `ModelBuilder` build function. During this process, SageMaker AI also creates HMAC signing for the pickle file and adds the secret key in the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API as an environment variable during `deploy` (or `create`). The endpoint launch uses the environment variable to validate the integrity of the pickle file.

```
# Build the model according to the model server specification and save it as files in the working directory
model = model_builder.build()
```

Deploy your model with the model’s existing `deploy` method. In this step, SageMaker AI sets up an endpoint to host your model as it starts making predictions on incoming requests. Although the `ModelBuilder` infers the endpoint resources needed to deploy your model, you can override those estimates with your own parameter values. The following example directs SageMaker AI to deploy the model on a single `ml.c6i.xlarge` instance. A model constructed from `ModelBuilder` enables live logging during deployment as an added feature.

```
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.c6i.xlarge"
)
```

If you want more fine-grained control over the endpoint resources assigned to your model, you can use a `ResourceRequirements` object. With the `ResourceRequirements` object, you can request a minimum number of CPUs, accelerators, and copies of models you want to deploy. You can also request a minimum and maximum bound of memory (in MB). To use this feature, you need to specify your endpoint type as `EndpointType.INFERENCE_COMPONENT_BASED`. The following example requests four accelerators, a minimum memory size of 1024 MB, and one copy of your model to be deployed to an endpoint of type `EndpointType.INFERENCE_COMPONENT_BASED`.

```
resource_requirements = ResourceRequirements(
    requests={
        "num_accelerators": 4,
        "memory": 1024,
        "copies": 1,
    },
    limits={},
)
predictor = model.deploy(
    mode=Mode.SAGEMAKER_ENDPOINT,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    resources=resource_requirements,
    role="role"
)
```

## Bring your own container (BYOC)
<a name="how-it-works-modelbuilder-creation-mb-byoc"></a>

If you want to bring your own container (extended from a SageMaker AI container), you can also specify the image URI as shown in the following example. You also need to identify the model server that corresponds to the image for `ModelBuilder` to generate artifacts specific to the model server.

```
model_builder = ModelBuilder(
    model=model,
    model_server=ModelServer.TORCHSERVE,
    schema_builder=SchemaBuilder(X_test, y_pred),
    image_uri="123123123123.dkr.ecr.ap-southeast-2.amazonaws.com/byoc-image:xgb-1.7-1")
)
```

## Using ModelBuilder in local mode
<a name="how-it-works-modelbuilder-creation-local"></a>

You can deploy your model locally by using the `mode` argument to switch between local testing and deployment to an endpoint. You need to store the model artifacts in the working directory, as shown in the following snippet:

```
model = XGBClassifier()
model.fit(X_train, y_train)
model.save_model(model_dir + "/my_model.xgb")
```

Pass the model object, a `SchemaBuilder` instance, and set mode to `Mode.LOCAL_CONTAINER`. When you call the `build` function, `ModelBuilder` automatically identifies the supported framework container and scans for dependencies. The following example demonstrates model creation with an XGBoost model in local mode.

```
model_builder_local = ModelBuilder(
    model=model,
    schema_builder=SchemaBuilder(X_test, y_pred),
    role_arn=execution-role,
    mode=Mode.LOCAL_CONTAINER
)
xgb_local_builder = model_builder_local.build()
```

Call the `deploy` function to deploy locally, as shown in the following snippet. If you specify parameters for instance type or count, these arguments are ignored.

```
predictor_local = xgb_local_builder.deploy()
```

### Troubleshooting local mode
<a name="how-it-works-modelbuilder-creation-troubleshoot"></a>

Depending on your individual local setup, you may encounter difficulties running `ModelBuilder` smoothly in your environment. See the following list for some issues you may face and how to resolve them.
+ **Already already in use**: You may encounter an `Address already in use` error. In this case, it is possible that a Docker container is running on that port or another process is utilizing it. You can follow the approach outlined in [Linux documentation](https://www.cyberciti.biz/faq/what-process-has-open-linux-port/) to identify the process and gracefully redirect your local process from port 8080 to another port or clean up the Docker instance.
+ **IAM Permission Issue**: You might encounter a permission issue when trying to pull an Amazon ECR image or access Amazon S3. In this case, navigate to the execution role of the notebook or Studio Classic instance to verify the policy for `SageMakerFullAccess` or the respective API permissions.
+ **EBS volume capacity issue**: If you deploy a large language model (LLM), you might run out of space while running Docker in local mode or experience space limitations for the Docker cache. In this case, you can try to move your Docker volume to a filesystem that has enough space. To move your Docker volume, complete the following steps:

  1. Open a terminal and run `df` to display disk usage, as shown in the following output:

     ```
     (python3) sh-4.2$ df
     Filesystem     1K-blocks      Used Available Use% Mounted on
     devtmpfs       195928700         0 195928700   0% /dev
     tmpfs          195939296         0 195939296   0% /dev/shm
     tmpfs          195939296      1048 195938248   1% /run
     tmpfs          195939296         0 195939296   0% /sys/fs/cgroup
     /dev/nvme0n1p1 141545452 135242112   6303340  96% /
     tmpfs           39187860         0  39187860   0% /run/user/0
     /dev/nvme2n1   264055236  76594068 176644712  31% /home/ec2-user/SageMaker
     tmpfs           39187860         0  39187860   0% /run/user/1002
     tmpfs           39187860         0  39187860   0% /run/user/1001
     tmpfs           39187860         0  39187860   0% /run/user/1000
     ```

  1. Move the default Docker directory from `/dev/nvme0n1p1` to `/dev/nvme2n1` so you can fully utilize the 256 GB SageMaker AI volume. For more details, see documentation about how to [move your Docker directory](https://www.guguweb.com/2019/02/07/how-to-move-docker-data-directory-to-another-location-on-ubuntu/).

  1. Stop Docker with the following command:

     ```
     sudo service docker stop
     ```

  1. Add a `daemon.json` to `/etc/docker` or append the following JSON blob to the existing one.

     ```
     {
         "data-root": "/home/ec2-user/SageMaker/{created_docker_folder}"
     }
     ```

  1. Move the Docker directory in `/var/lib/docker` to `/home/ec2-user/SageMaker AI` with the following command:

     ```
     sudo rsync -aP /var/lib/docker/ /home/ec2-user/SageMaker/{created_docker_folder}
     ```

  1. Start Docker with the following command:

     ```
     sudo service docker start
     ```

  1. Clean trash with the following command:

     ```
     cd /home/ec2-user/SageMaker/.Trash-1000/files/*
     sudo rm -r *
     ```

  1. If you are using a SageMaker notebook instance, you can follow the steps in the [Docker prep file](https://github.com/melanie531/amazon-sagemaker-pytorch-lightning-distributed-training/blob/main/prepare-docker.sh) to prepare Docker for local mode.

## ModelBuilder examples
<a name="how-it-works-modelbuilder-creation-example"></a>

For more examples of using `ModelBuilder` to build your models, see [ModelBuilder sample notebooks](https://github.com/aws-samples/sagemaker-hosting/blob/main/SageMaker-Model-Builder).

# Inference optimization for Amazon SageMaker AI models
<a name="model-optimize"></a>

With Amazon SageMaker AI, you can improve the performance of your generative AI models by applying inference optimization techniques. By optimizing your models, you can attain better cost-performance for your use case. When you optimize a model, you choose which of the supported optimization techniques to apply, including quantization, speculative decoding, and compilation. After your model is optimized, you can run an evaluation to see performance metrics for latency, throughput, and price.

For many models, SageMaker AI also provides several pre-optimized versions, where each caters to different applications needs for latency and throughput. For such models, you can deploy one of the optimized versions without first optimizing the model yourself.

## Optimization techniques
<a name="optimization-techniques"></a>

Amazon SageMaker AI supports the following optimization techniques.

### Compilation
<a name="compilation"></a>

*Compilation* optimizes the model for the best available performance on the chosen hardware type without a loss in accuracy. You can apply model compilation to optimize LLMs for accelerated hardware, such as GPU instances, AWS Trainium instances, or AWS Inferentia instances.

When you optimize a model with compilation, you benefit from ahead-of-time compilation. You reduce the model's deployment time and auto-scaling latency because the model weights don't require just-in-time compilation when the model deploys to a new instance.

If you choose to compile your model for a GPU instance, SageMaker AI uses the TensorRT-LLM library to run the compilation. If you choose to compile your model for an AWS Trainium or AWS Inferentia instance, SageMaker AI uses the AWS Neuron SDK to run the compilation.

### Quantization
<a name="quantization"></a>

*Quantization* is a technique to reduce the hardware requirements of a model by using a less precise data type for the weights and activations. After you optimize a model with quantization, you can host it on less expensive and more available GPUs. However, the quantized model might be less accurate than the source model that you optimized. 

The data formats that SageMaker AI supports for quantization vary from model to model. The supported formats include the following:
+ INT4-AWQ – A 4-bit data format. Activation-aware Weight Quantization (AWQ) is a quantization technique for LLMs that is efficient, accurate, low-bit, and weight-only.
+ FP8 – 8-bit Floating Point (FP8) is a low-precision format for floating point numbers. It balances memory efficiency and model accuracy by representing values with fewer bits than standard FP16 floating point format.
+ INT8-SmoothQuant – AN 8-bit data format. SmoothQuant is a mixed-precision quantization method that scales activations and weights jointly by balancing their dynamic ranges.

### Speculative decoding
<a name="speculative-decoding"></a>

*Speculative decoding* is a technique to speed up the decoding process of large LLMs. It optimizes models for latency without compromising the quality of the generated text.

This technique uses a smaller but faster model called the *draft* model. The draft model generates candidate tokens, which are then validated by the larger but slower *target* model. At each iteration, the draft model generates multiple candidate tokens. The target model verifies the tokens, and if it finds that a particular token is not acceptable, it rejects the token and regenerates it. So, the target model both verifies tokens and generates a small amount of them.

The draft model is significantly faster than the target model. It generates all the tokens quickly and then sends batches of them to the target model for verification. The target model evaluate them all in parallel, which speeds up the final response.

SageMaker AI offers a pre-built draft model that you can use, so you don't have to build your own. If you prefer to use your own custom draft model, SageMaker AI also supports this option.

### Fast model loading
<a name="fast-model-loading"></a>

The *fast model loading* technique prepares an LLM so that SageMaker AI can load it onto an ML instance more quickly.

To prepare the model, SageMaker AI shards it in advance by dividing it into portions that can each reside on a separate GPU for distributed inference. Also, SageMaker AI stores the model weights in equal-sized chunks that SageMaker AI can load onto the instance concurrently.

When SageMaker AI loads the optimized model onto the instance, it streams the model weights directly from Amazon S3 onto the GPUs of the instance. By streaming the weights, SageMaker AI omits several time-consuming steps that are normally necessary. These steps include downloading the model artifacts from Amazon S3 to disk, loading the model artifacts onto the host memory, and sharding the model on the host before finally loading the shards onto the GPUs.

After you optimize your model for faster loading, you can deploy it more quickly to a SageMaker AI endpoint. Also, if you configure the endpoint to use auto scaling, it scales out more quickly to accommodate increases in traffic.

# Deploy a pre-optimized model
<a name="model-optimize-preoptimized"></a>

Some models in JumpStart are pre-optimized by SageMaker AI, which means that you can deploy optimized versions of these models without first creating an inference optimization job. 

For the list of models with pre-optimized options, see [Pre-optimized JumpStart models](#pre-optimized-js).

## Amazon SageMaker Studio
<a name="preoptimized-studio"></a>

Use the following procedure to deploy a pre-optimized JumpStart model using Amazon SageMaker Studio.

**To deploy a pre-optimized model**

1. In Studio, in the navigation menu on the left, choose **JumpStart**.

1. On the **All public models** page, choose one of the models that are pre-optimized.

1. On the model details page, choose **Deploy**.

1. On the deployment page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the **License agreement** section. If the terms are acceptable for your use case, select the checkbox for **I accept the EULA, and read the terms and conditions.**

   For more information, see [End-user license agreements](jumpstart-foundation-models-choose.md#jumpstart-foundation-models-choose-eula).

1. For **Endpoint name** and **Initial instance count**, accept the default values or set custom ones.

1. For **Instance type**, keep the default value. Otherwise, you can't deploy a pre-optimized configuration.

1. Under **Models**, expand the model configuration. Studio shows a table that provides the pre-optimized configurations that you can choose from. Each option has metrics for latency and throughput. Choose the option that best suits your application needs.

1. Choose **Deploy**.

## SageMaker AI Python SDK
<a name="preoptimized-sdk"></a>

You can deploy a pre-optimized model by using the SageMaker AI Python SDK in your project. First, you define a `Model` instance by using the `ModelBuilder` class. Then, you use the `set_deployment_config()` method to set the pre-optimized configuration that you want to deploy. Then, you use the `build()` method to build the model. Finally, you use the `deploy()` method to deploy it to an inference endpoint.

For more information about the classes and methods used in the following examples, see [APIs](https://sagemaker.readthedocs.io/en/stable/api/index.html) in the SageMaker AI Python SDK documentation.

**To set up your project**

1. In your application code, import the necessary libraries. The following example imports the SDK for Python (Boto3). It also imports the modules from the SageMaker AI Python SDK that you use to define and work with models:

   ```
   import boto3
   from sagemaker.serve.builder.model_builder import ModelBuilder
   from sagemaker.serve.builder.schema_builder import SchemaBuilder
   from sagemaker.session import Session
   ```

1. Initialize a SageMaker AI session. The following example uses the `Session()` class:

   ```
   sagemaker_session = Session()
   ```

**To define your model**

1. Create a `SchemaBuilder` instance, and provide input and output samples. You supply this instance to the `ModelBuilder` class when you define a model. With it, SageMaker AI automatically generates the marshalling functions for serializing and deserializing the input and output.

   For more information about using the `SchemaBuilder` and `ModelBuilder` classes, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md).

   The following example provides sample input and output strings to the `SchemaBuilder` class:

   ```
   response = "Jupiter is the largest planet in the solar system. It is the fifth planet from the sun."
   sample_input = {
       "inputs": "What is the largest planet in the solar system?",
       "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
   }
   sample_output = [{"generated_text": response}]
   schema_builder = SchemaBuilder(sample_input, sample_output)
   ```

1. Define your model to SageMaker AI. The following example sets the parameters to initialize a `ModelBuilder` instance:

   ```
   model_builder = ModelBuilder(
       model="jumpstart-model-id",
       schema_builder=schema_builder,
       sagemaker_session=sagemaker_session,
       role_arn=sagemaker_session.get_caller_identity_arn(),
   )
   ```

   This example uses a JumpStart model. Replace `jumpstart-model-id` with the ID of a JumpStart model, such as `meta-textgeneration-llama-3-70b`.

**To retrieve benchmark metrics**

1. To determine which pre-optimized configuration you want to deploy, look up the options that SageMaker AI provides. The following example displays them:

   ```
   model_builder.display_benchmark_metrics()
   ```

   This `display_benchmark_metrics()` method prints a table like the following:

   ```
   | Instance Type   | Config Name   |   Concurrent Users |   Latency, TTFT (P50 in sec) |   Throughput (P50 in tokens/sec/user) |
   |:----------------|:--------------|-------------------:|-----------------------------:|--------------------------------------:|
   | ml.g5.48xlarge  | lmi-optimized |                  1 |                         2.25 |                                 49.70 |
   | ml.g5.48xlarge  | lmi-optimized |                  2 |                         2.28 |                                 21.10 |
   | ml.g5.48xlarge  | lmi-optimized |                  4 |                         2.37 |                                 14.10 |
   . . .
   | ml.p4d.24xlarge | lmi-optimized |                  1 |                         0.10 |                                137.40 |
   | ml.p4d.24xlarge | lmi-optimized |                  2 |                         0.11 |                                109.20 |
   | ml.p4d.24xlarge | lmi-optimized |                  4 |                         0.13 |                                 85.00 |
   . . .
   ```

   In the first column, the table lists potential instance types that you can use to host your chosen JumpStart model. For each instance type, under `Config Name`, it lists the names of the pre-optimized configurations. The configurations that SageMaker AI provides are named `lmi-optimized`. For each instance type and configuration, the table provides benchmark metrics. These metrics indicate the throughput and latency that your model will support for different numbers of concurrent users.

1. Based on the benchmark metrics, pick the instance type and configuration name that best supports your performance needs. You will use these values when you create a deployment configuration.

**To deploy a pre-optimized model**

1. Create a deployment configuration. The following example uses a `ModelBuilder` instance. It passes an instance type and configuration name to the to the `set_deployment_config()` method:

   ```
   model_builder.set_deployment_config(
       config_name="config-name", 
       instance_type="instance-type",
   )
   ```

   Replace *`config-name`* with a configuration name from the table, such as such as `lmi-optimized`. Replace `instance-type` with an instance type from the table, such as `ml.p4d.24xlarge`.

1. Build your model. The following example uses the `.build()` method of the `ModelBuilder` instance:

   ```
   optimized_model = model_builder.build()
   ```

   The `.build()` method returns a deployable `Model` instance.

1. Deploy your model to an inference endpoint. The following example uses the `.deploy()` method of the `Model` instance:

   ```
   predictor = optimized_model.deploy(accept_eula=True)
   ```

   The `deploy()` method returns a `Predictor` instance, which you can use to send inference requests to the model.

**To test your model with an inference request**
+ After you deploy your model to an inference endpoint, test the model's predictions. The following example sends an inference request by using the `Predictor` instance:

  ```
  predictor.predict(sample_input)
  ```

  The model returns the text that it generates with a response like the following:

  ```
  {'generated_text': ' Jupiter is the largest planet in the solar system. It is the fifth planet from the sun. It is a gas giant with . . .'}
  ```

## Pre-optimized JumpStart models
<a name="pre-optimized-js"></a>

The following are the JumpStart models that have pre-optimized configurations.

**Meta**
+ Llama 3.1 70B Instruct
+ Llama 3.1 70B
+ Llama 3.1 405B Instruct FP8
+ Llama 3.1 405B FP8
+ Llama 3 8B Instruct
+ Llama 3 8B
+ Llama 3 70B Instruct
+ Llama 3 70B
+ Llama 2 70B Chat
+ Llama 2 7B Chat
+ Llama 2 13B Chat

**HuggingFace**
+ Mixtral 8x7B Instruct
+ Mixtral 8x7B
+ Mistral 7B Instruct
+ Mistral 7B

### Pre-compiled JumpStart models
<a name="pre-compiled"></a>

For some models and configurations, SageMaker AI provides models that are pre-compiled for specific AWS Inferentia and AWS Trainium instances. For these, if you create a compilation optimization job, and you choose ml.inf2.48xlarge or ml.trn1.32xlarge as the deployment instance type, SageMaker AI fetches the compiled artifacts. Because the job uses a model that’s already compiled, it completes quickly without running the compilation from scratch.

The following are the JumpStart models for which SageMaker AI has pre-compiled models:

**Meta**
+ Llama3 8B
+ Llama3 70B
+ Llama2 7B
+ Llama2 70B
+ Llama2 13B
+ Code Llama 7B
+ Code Llama 70B

**HuggingFace**
+ Mistral 7B

# Create an inference optimization job
<a name="model-optimize-create-job"></a>

You can create an inference optimization job by using Studio or the SageMaker AI Python SDK. The job optimizes your model by applying the techniques that you choose. For more information, see [Optimization techniques](model-optimize.md#optimization-techniques).

**Instance pricing for inference optimization jobs**  
When you create a inference optimization job that applies quantization or compilation, SageMaker AI chooses which instance type to use to run the job. You are charged based on the instance used.  
For the possible instance types and their pricing details, see the inference optimization pricing information on the [Amazon SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/) page.  
You incur no additional costs for jobs that apply speculative decoding.

For the supported models that you can optimize, see [Supported models reference](optimization-supported-models.md).

## Amazon SageMaker Studio
<a name="optimize-create-studio"></a>

Complete the following steps to create an inference optimization job in Studio.

**To begin creating an optimization job**

1. In SageMaker AI Studio, create an optimization job through any of the following paths:
   + To create a job for a JumpStart model, do the following:

     1. In the navigation menu, choose **JumpStart**.

     1. On the **All public models** page, choose a model provider, and then choose one of the models that supports optimization.

     1. On the model details page, choose **Optimize**. This button is enabled only for models that support optimization.

     1. On the **Create inference optimization job** page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the **License agreement** section. If the terms are acceptable for your use case, select the checkbox for **I accept the EULA, and read the terms and conditions.**
   + To create a job for a fine-tuned JumpStart model, do the following:

     1. In the navigation menu, under **Jobs**, choose **Training**.

     1. On the **Training Jobs** page, choose the name of a job that you used to fine tune a JumpStart model. These jobs have the type **JumpStart training** in the **Job type** column.

     1. On the details page for the training job, choose **Optimize**.
   + To create a job for a custom model, do the following:

     1. In the navigation menu, under **Jobs**, choose **Inference optimization**.

     1. Choose **Create new job**.

     1. On the **Create inference optimization job** page, choose **Add model**.

     1. In the **Add model** window, choose **Custom Model**.

     1. Choose one of the following options:

        **Use your existing model** - Select this option to optimize a model that you've already created in SageMaker AI.

        **Existing model name** - enter the name of your SageMaker AI model.

        **From S3** - Select this option to provide model artifacts from Amazon S3. For **S3 URI**, enter the URI for the location in Amazon S3 where you've stored your model artifacts.

     1. (Optional) For ****Output model name****, you can enter a custom name for the optimized model that the job creates. If you don't provide a name, Studio automatically generates one based on your selection.

1. On the **Create inference optimization job** page, for **Job name**, you can accept the default name that SageMaker AI assigns. Or, to enter a custom job name, choose the **Job name** field, and choose **Enter job name**.

**To set the optimization configurations**

1. For **Deployment instance type**, choose the instance type that you want to optimize the model for.

   The instance type affects what optimization techniques you can choose. For most types that use GPU hardware, the supported techniques are **Quantization** and **Speculative decoding**. If you choose an instance that uses custom silicon, like the AWS Inferentia instance ml.inf2.8xlarge, the supported technique is **Compilation**, which you can use to compile the model for that specific hardware type.

1. Select one or more of the optimization techniques that Studio provides:
   + If you select **Quantization**, choose a data type for **Precision data type**. 
   + If you select **Speculative decoding**, choose one of the following options:
     + **Use SageMaker AI draft model** – Choose to use the draft model that SageMaker AI provides.
**Note**  
If you choose to use the SageMaker AI draft model, you must also enable network isolation. Studio provides this option under **Security**.
     + **Choose JumpStart draft model** – Choose to select a model from the JumpStart catalog to use as your draft model.
     + **Choose your own draft model** – Choose to use your own draft model, and provide the S3 URI that locates it.
   + If you choose **Fast model loading**, Studio shows the `OPTION_TENSOR_PARALLEL_DEGREE` environment variable. Use the **Value** field to set the degree of tensor parallelism. The value must evenly divide the number of GPUs in the instance you chose for **Deployment instance type**. For example, to shard your model while using an instance with 8 GPUs, use the values 2, 4, or 8.
   + If you set **Deployment instance type** to an AWS Inferentia or AWS Trainium instance, Studio might show that **Compilation** is the one supported option. In that case, Studio selects this option for you.

1. For **Output**, enter the URI of a location in Amazon S3. There, SageMaker AI stores the artifacts of the optimized model that your job creates.

1. (Optional) Expand **Advanced options** for more fine-grained control over settings such as the IAM role, VPC, and environment variables. For more information, see *Advanced options* below.

1. When you're finished configuring the job, choose **Create job**.

   Studio shows the job details page, which shows the job status and all of its settings.

### Advanced options
<a name="set-advanced-optimization-options"></a>

You can set the following advanced options when you create an inference optimization job.

Under **Configurations**, you can set the following options:

**Tensor parallel degree **  
A value for the degree of *tensor parallelism*. Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. The value must evenly divide the number of GPUs in your cluster.

**Maximum token length**  
The limit for the number of tokens to be generated by the model. Note that the model might not always generate the maximum number of tokens.

**Concurrency**  
The ability to run multiple instances of a model on the same underlying hardware. Use concurrency to serve predictions to multiple users and to maximize hardware utilization.

**Batch size**  
If your model does *batch inferencing*, use this option to control the size of the batches that your model processes.  
Batch inferencing generates model predictions on a batch of observations. It's a good option for large datasets or if you don't need an immediate response to an inference request. 

Under **Security**, you can set the following options:

**IAM Role**  
An IAM role that enables SageMaker AI to perform tasks on your behalf. During model optimization, SageMaker AI needs your permission to:  
+ Read input data from an S3 bucket
+ Write model artifacts to an S3 bucket
+ Write logs to Amazon CloudWatch Logs
+ Publish metrics to Amazon CloudWatch
You grant permissions for all of these tasks to an IAM role.  
For more information, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

**Encryption KMS key**  
A key in AWS Key Management Service (AWS KMS). SageMaker AI uses they key to encrypt the artifacts of the optimized model when SageMaker AI uploads the model to Amazon S3.

**VPC**  
SageMaker AI uses this information to create network interfaces and attach them to your model containers. The network interfaces provide your model containers with a network connection within your VPC that is not connected to the internet. They also enable your model to connect to resources in your private VPC.  
For more information, see [Give SageMaker AI Hosted Endpoints Access to Resources in Your Amazon VPC](host-vpc.md).

**Enable network isolation**  
Activate this option if you want to restrict your container's internet access. Containers that run with network isolation can’t make any outbound network calls.  
You must activate this option when you optimize with speculative decoding and you use the SageMaker AI draft model.  
For more information about network isolation, see [Network Isolation](mkt-algo-model-internet-free.md#mkt-algo-model-internet-free-isolation).

Under **Advanced container definition**, you can set the following options:

**Stopping condition**  
Specifies a limit to how long a job can run. When the job reaches the time limit, SageMaker AI ends the job. Use this option to cap costs.

**Tags**  
Key-value pairs associated with the optimization job.  
For more information about tags, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html) in the *AWS General Reference*.

**Environment variables**  
Key-value pairs that define the environment variables to set in the model container.

## SageMaker AI Python SDK
<a name="optimize-create-pysdk"></a>

You can create an inference optimization job by using the SageMaker AI Python SDK in your project. First, you define a `Model` instance by using the `ModelBuilder` class. Then, you use the `optimize()` method to run a job that optimizes your model with quantization, speculative decoding, or compilation. When the job completes, you deploy the model to an inference endpoint by using the `deploy()` method.

For more information about the classes and methods used in the following examples, see [APIs](https://sagemaker.readthedocs.io/en/stable/api/index.html) in the SageMaker AI Python SDK documentation.

**To set up your project**

1. In your application code, import the necessary libraries. The following example imports the SDK for Python (Boto3). It also imports the classes from the SageMaker AI Python SDK that you use to define and work with models:

   ```
   import boto3
   from sagemaker.serve.builder.model_builder import ModelBuilder
   from sagemaker.serve.builder.schema_builder import SchemaBuilder
   from sagemaker.session import Session
   from pathlib import Path
   ```

1. Initialize a SageMaker AI session. The following example uses the `Session()` class:

   ```
   sagemaker_session = Session()
   ```

**To define your model**

1. Create a `SchemaBuilder` instance, and provide input and output samples. You supply this instance to the `ModelBuilder` class when you define a model. With it, SageMaker AI automatically generates the marshalling functions for serializing and deserializing the input and output.

   For more information about using the `SchemaBuilder` and `ModelBuilder` classes, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md).

   The following example provides sample input and output strings to the `SchemaBuilder` class:

   ```
   response = "Jupiter is the largest planet in the solar system. It is the fifth planet from the sun."
   sample_input = {
       "inputs": "What is the largest planet in the solar system?",
       "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
   }
   sample_output = [{"generated_text": response}]
   schema_builder = SchemaBuilder(sample_input, sample_output)
   ```

1. Define your model to SageMaker AI. The following example sets the parameters to initialize a `ModelBuilder` instance:

   ```
   model_builder = ModelBuilder(
       model="jumpstart-model-id",
       schema_builder=schema_builder,
       sagemaker_session=sagemaker_session,
       role_arn=sagemaker_session.get_caller_identity_arn(),
   )
   ```

   This example uses a JumpStart model. Replace `jumpstart-model-id` with the ID of a JumpStart model, such as `meta-textgeneration-llama-3-70b`.
**Note**  
If you want to optimize with speculative decoding, and you want to use the SageMaker AI draft, you must enable network isolation. To enable it, include the following argument when you initialize a `ModelBuilder` instance:  

   ```
   enable_network_isolation=True,
   ```
For more information about network isolation, see [Network Isolation](mkt-algo-model-internet-free.md#mkt-algo-model-internet-free-isolation).

**To optimize with quantization**

1. To run a quantization job, use the `optimize()` method, and set the `quantization_config` argument. The following example sets `OPTION_QUANTIZE` as an environment variable in the optimization container:

   ```
   optimized_model = model_builder.optimize(
       instance_type="instance-type",
       accept_eula=True,
       quantization_config={
           "OverrideEnvironment": {
               "OPTION_QUANTIZE": "awq",
           },
       },
       output_path="s3://output-path",
   )
   ```

   In this example, replace *`instance-type`* with an ML instance, such as `ml.p4d.24xlarge`. Replace *`s3://output-path`* with the path to the S3 location where you store the optimized model that the job creates.

   The `optimize()` method returns a `Model` object, which you can use to deploy your model to an endpoint.

1. When the job completes, deploy the model. The following example uses the `deploy()` method:

   ```
   predictor = optimized_model.deploy(
       instance_type="instance-type", 
       accept_eula=True,
   )
   ```

   In this example, replace *`instance-type`* with an ML instance, such as `ml.p4d.24xlarge`. 

   The `deploy()` method returns a predictor object, which you can use to send inference requests to the endpoint that hosts the model.

**To optimize with speculative decoding using the SageMaker AI draft model**

When you optimize your model with speculative decoding, you can choose to use a draft model that SageMaker AI provides, or you can use your own. The following examples use the SageMaker AI draft model.
**Prerequisite**  
To optimize with speculative decoding and the SageMaker AI draft model, you must enable network isolation when you define your model.

1. To run a speculative decoding job, use the `optimize()` method, and set the `speculative_decoding_config` argument. The following example sets the `ModelProvider` key to `SAGEMAKER` to use the draft model that SageMaker AI provides.

   ```
   optimized_model = model_builder.optimize(
       instance_type="instance-type",
       accept_eula=True,
       speculative_decoding_config={
           "ModelProvider": "SAGEMAKER",
       },
   )
   ```

   In this example, replace *`instance-type`* with an ML instance, such as `ml.p4d.24xlarge`.

   The `optimize()` method returns a `Model` object, which you can use to deploy your model to an endpoint.

1. When the job completes, deploy the model. The following example uses the `deploy()` method:

   ```
   predictor = optimized_model.deploy(accept_eula=True)
   ```

   The `deploy()` method returns a predictor object, which you can use to send inference requests to the endpoint that hosts the model.

**To optimize with speculative decoding using a custom draft model**

Before you can provide your custom draft model to SageMaker AI, you must first upload the model artifacts to Amazon S3.

The following examples demonstrate one possible way to provide a custom draft model. The examples download the draft model from the Hugging Face Hub, upload it to Amazon S3, and provide the S3 URI to the `speculative_decoding_config` argument.

1. If you want to download a model from the Hugging Face Hub, add the `huggingface_hub` library to your project, and download a model with the `snapshot_download()` method. The following example downloads a model to a local directory:

   ```
   import huggingface_hub
   
   huggingface_hub.snapshot_download(
       repo_id="model-id",
       revision="main",
       local_dir=download-dir,
       token=hf-access-token,
   )
   ```

   In this example, replace *`model-id`* with the ID of a model the Hugging Face Hub, such as `meta-llama/Meta-Llama-3-8B`. Replace *`download-dir`* with a local directory. Replace *`hf-access-token`* with your user access token. To learn how to get your access token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) in the Hugging Face documentation.

   For more information about the `huggingface_hub` library, see [Hub client library](https://huggingface.co/docs/huggingface_hub/en/index) in the Hugging Face documentation.

1. To make your downloaded model available to SageMaker AI, upload it to Amazon S3. The following example uploads the model with the `sagemaker_session` object:

   ```
   custom_draft_model_uri = sagemaker_session.upload_data(
       path=hf_local_download_dir.as_posix(),
       bucket=sagemaker_session.default_bucket(),
       key_prefix="prefix",
   )
   ```

   In this example, replace *`prefix`* with a qualifier that helps you distinguish the draft model in S3, such as `spec-dec-custom-draft-model`.

   The `upload_data()` method returns the S3 URI for the model artifacts.

1. To run a speculative decoding job, use the `optimize()` method, and set the `speculative_decoding_config` argument. The following example sets the `ModelSource` key to the S3 URI of the custom draft model:

   ```
   optimized_model = model_builder.optimize(
       instance_type="instance-type",
       accept_eula=True,
       speculative_decoding_config={
           "ModelSource": custom_draft_model_uri + "/",
       },
   )
   ```

   In this example, replace *`instance-type`* with an ML instance, such as `ml.p4d.24xlarge`.

   The `optimize()` method returns a `Model` object, which you can use to deploy your model to an endpoint.

1. When the job completes, deploy the model. The following example uses the `deploy()` method:

   ```
   predictor = optimized_model.deploy(accept_eula=True)
   ```

   The `deploy()` method returns a predictor object, which you can use to send inference requests to the endpoint that hosts the model.

**To optimize with compilation**

1. To run a compilation job, use the `optimize()` method, and set the `compilation_config` argument. The following example uses the `OverrideEnvironment` key to set the necessary environment variables in the optimization container:

   ```
   optimized_model = model_builder.optimize(
       instance_type="instance-type",
       accept_eula=True,
       compilation_config={
           "OverrideEnvironment": {
               "OPTION_TENSOR_PARALLEL_DEGREE": "24",
               "OPTION_N_POSITIONS": "8192",
               "OPTION_DTYPE": "fp16",
               "OPTION_ROLLING_BATCH": "auto",
               "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
               "OPTION_NEURON_OPTIMIZE_LEVEL": "2",
           }
       },
       output_path="s3://output-path",
   )
   ```

   In this example, set *`instance-type`* to an ML instance type with accelerated hardware. For example, for accelerated inference with AWS Inferentia, you could set the type to an Inf2 instance, such as `ml.inf2.48xlarge`. Replace *`s3://output-path`* with the path to the S3 location where you store the optimized model that the job creates.

1. When the job completes, deploy the model. The following example uses the `deploy()` method:

   ```
   predictor = optimized_model.deploy(accept_eula=True)
   ```

   The `deploy()` method returns a predictor object, which you can use to send inference requests to the endpoint that hosts the model.

**To test your model with an inference request**
+ To send a test inference request to your deployed model, use the `predict()` method of a predictor object. The following example passes the `sample_input` variable that was also passed to the `SchemaBuilder` class in the examples to define your model:

  ```
  predictor.predict(sample_input)
  ```

  The sample input has the prompt, `"What is the largest planet in the solar system?"`. The `predict()` method returns the response that the model generated, as shown by the following example:

  ```
  {'generated_text': ' Jupiter is the largest planet in the solar system. It is the fifth planet from the sun. It is a gas giant with . . .'}
  ```

## AWS SDK for Python (Boto3)
<a name="optimize-create-pysdk-boto"></a>

You can use the AWS SDK for Python (Boto3) to programmatically create and manage inference optimization jobs. This section provides examples for different optimization techniques.

**Prerequisites**

Before creating an optimization job with Boto3, ensure you have:
+ Configured AWS credentials - Set up your AWS credentials with appropriate permissions
+ Created a SageMaker AI model (if using an existing model)
+ Prepared training data in S3 (for speculative decoding optimization, supported context length up to 4096)
+ IAM role with necessary permissions - Your execution role must have permissions to access S3 and create SageMaker resources 

**Example: Create an Optimization Job with EAGLE Speculative Decoding (Llama 3.3 70B)**

This example demonstrates creating an optimization job for a large language model using the EAGLE speculative decoding technique:

```
import boto3

# Initialize SageMaker client
sagemaker_client = boto3.client('sagemaker', region_name='us-west-2')

# Step 1: Create a SageMaker model (if not already created)
model_response = sagemaker_client.create_model(
    ModelName='meta-llama-3-3-70b-instruct',
    ExecutionRoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
    PrimaryContainer={
        'Image': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:<tag>',
        'ModelDataSource': {
            'S3DataSource': {
                'S3Uri': 's3://my-bucket/models/Llama-3.3-70B-Instruct/',
                'S3DataType': 'S3Prefix',
                'CompressionType': 'None'
            }
        },
        'Environment': {
            'SAGEMAKER_ENV': '1',
            'SAGEMAKER_MODEL_SERVER_TIMEOUT': '3600'
        }
    }
)

# Step 2: Create optimization job with speculative decoding
optimization_response = sagemaker_client.create_optimization_job(
    OptimizationJobName='llama-optim-job-eagle-speculative-decoding',
    RoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
    ModelSource={
        'SageMakerModel': {
            'ModelName': 'meta-llama-3-3-70b-instruct'
        }
    },
    DeploymentInstanceType='ml.p4d.24xlarge',
    # MaxInstanceCount specifies the maximum number of instances for distributed training
    MaxInstanceCount=4,
    OptimizationConfigs=[
        {
            'ModelSpeculativeDecodingConfig': {
                'Technique': 'EAGLE',
                'TrainingDataSource': {
                    'S3Uri': 's3://my-bucket/training_data/ultrachat_8k/',
                    'S3DataType': 'S3Prefix'
                }
            }
        }
    ],
    OutputConfig={
        'S3OutputLocation': 's3://my-bucket/optimized-models/llama-optim-output/',
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 432000  # 5 days
    }
)

print(f"Optimization job ARN: {optimization_response['OptimizationJobArn']}")
```

**Example: Create an Optimization Job from S3 Model Artifacts (Qwen3 32B)**

This example shows how to create an optimization job using model artifacts directly from S3:

```
import boto3

sagemaker_client = boto3.client('sagemaker', region_name='us-west-2')

# Create model from S3 artifacts
model_response = sagemaker_client.create_model(
    ModelName='qwen3-32b',
    ExecutionRoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
    PrimaryContainer={
        'Image': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:<tag>',
        'Mode': 'SingleModel',
        'ModelDataSource': {
            'S3DataSource': {
                'S3Uri': 's3://my-bucket/models/qwen3-32b/',
                'S3DataType': 'S3Prefix',
                'CompressionType': 'None'
            }
        },
        'Environment': {
            'AWS_REGION': 'us-west-2'
        }
    }
)

# Create optimization job with smaller training dataset
optimization_response = sagemaker_client.create_optimization_job(
    OptimizationJobName='qwen3-optim-job-eagle',
    RoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
    ModelSource={
        'SageMakerModel': {
            'ModelName': 'qwen3-32b'
        }
    },
    DeploymentInstanceType='ml.g6.48xlarge',
    MaxInstanceCount=4,
    OptimizationConfigs=[
        {
            'ModelSpeculativeDecodingConfig': {
                'Technique': 'EAGLE',
                'TrainingDataSource': {
                    'S3Uri': 's3://my-bucket/training_data/ultrachat_1k/',
                    'S3DataType': 'S3Prefix'
                }
            }
        }
    ],
    OutputConfig={
        'S3OutputLocation': 's3://my-bucket/optimized-models/qwen3-optim-output/',
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 432000  # 5 days
    }
)

print(f"Optimization job ARN: {optimization_response['OptimizationJobArn']}")
```

**Example: Monitor and Manage Optimization Jobs**

After creating an optimization job, you can monitor its progress and manage it using these commands:

```
import boto3

sagemaker_client = boto3.client('sagemaker', region_name='us-west-2')

# Describe optimization job to check status
describe_response = sagemaker_client.describe_optimization_job(
    OptimizationJobName='llama-optim-job-eagle-speculative-decoding'
)

print(f"Job Status: {describe_response['OptimizationJobStatus']}")

# List all optimization jobs (with pagination)
list_response = sagemaker_client.list_optimization_jobs(
    MaxResults=10,
    SortBy='CreationTime',
    SortOrder='Descending'
)

print("\nRecent optimization jobs:")
for job in list_response['OptimizationJobSummaries']:
    print(f"- {job['OptimizationJobName']}: {job['OptimizationJobStatus']}")

# Stop a running optimization job if needed
# sagemaker_client.stop_optimization_job(
#     OptimizationJobName='llama-optim-job-eagle-speculative-decoding'
# )

# Delete a completed or failed optimization job
# sagemaker_client.delete_optimization_job(
#     OptimizationJobName='llama-optim-job-eagle-speculative-decoding'
# )
```

Speculative decoding with Eagle Heads runs four sequential training jobs. Each job produces output that becomes the input to the next. Only the output from the final job is delivered to your S3 bucket. The intermediate outputs are encrypted and stored in an internal SageMaker AI service bucket for upto 20 days. SageMaker AI does not have permissions to de-crypt them. If you want the intermediate data removed before that time period, ensure your job has been completed or has stopped, and then open a support case [[https://docs.aws.amazon.com/awssupport/latest/user/case-management.html\$1creating-a-support-case](https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case)] for this data to be deleted. Include in the request your AWS account ID and the optimization job ARN.

## Limitations of the SageMaker AI draft model
<a name="sm-draft-model-limitations"></a>

For any model that you optimize with the SageMaker AI draft model, be aware of the requirements, restrictions, and supported environment variables.

**Requirements**

You must do the following:
+ Use a model that's provided by SageMaker JumpStart.
+ Enable network isolation for the model deployment.
+ If you deploy the model to a Large Model Inference (LMI) container, use a DJLServing container at version 0.28.0 or above.

  For the available containers, see [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) in the Deep Learning Containers GitHub repository.
+ If you fine tune the JumpStart model, use the safetensors format for the model weights.

  For more information about this format, see [Safetensors](https://huggingface.co/docs/safetensors/en/index) in the Hugging Face documentation.

**Restrictions**

You can't do the following:
+ Use the model in local test environments that you create with local mode. 

  For more information about local mode, see [Local Mode](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode) in the SageMaker AI Python SDK documentation.
+ Access the model container through the AWS Systems Manager Agent (SSM Agent). The SSM Agent provides shell-level access to your model container so that you can debug processes and log commands with Amazon CloudWatch. 

  For more information about this feature, see [Access containers through SSM](ssm-access.md).
+ Configure the model container for a core dump that occurs if the process crashes. 

  For more information about core dumps from model containers, see [ProductionVariantCoreDumpConfig](sagemaker/latest/APIReference/API_ProductionVariantCoreDumpConfig.html).
+ Deploy the model to multi-model endpoints, multi-container endpoints, or endpoints that host inference components. 

  For more information about these endpoint types, see [Multi-model endpoints](multi-model-endpoints.md), [Multi-container endpoints](multi-container-endpoints.md), and [Inference components](realtime-endpoints-deploy-models.md#inference-components).
+ Create a model package for the model. You use model packages to create deployable models that you publish on AWS Marketplace. 

  For more information about this feature, see [Create a Model Package Resource](sagemaker-mkt-create-model-package.md).
+ Use your own inference code in the model container.
+ Use a `requirements.txt` file in the model container. This type of file lists package dependencies.
+ Enable the Hugging Face parameter `trust_remote_code`.

**Supported environment variables**

You can configure the container only with the following environment variables:
+ Common environment variables for large model inference (LMI) containers. 

  For more information about these variables, see [Environment Variable Configurations](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#environment-variable-configurations) in the LMI container documentation.
+ Common environment variables for packages that the Hugging Face Hub provides in its Git repositories. 

  For the repositories, see [Hugging Face](https://github.com/huggingface) on GitHub.
+ Common PyTorch & CUDA environment variables. 

  For more information about these variables, see [Torch Environment Variables](https://pytorch.org/docs/stable/torch_environment_variables.html) in the PyTorch documentation.

# View the optimization job results
<a name="model-optimize-view-results"></a>

After you've created one or more optimization jobs, you can use Studio to view a summary table of all of your jobs, and you can view the details for any individual job.

## Amazon SageMaker Studio
<a name="optimization-results-studio"></a>

**To view the optimization job summary table**
+ In the Studio navigation menu, under **Jobs**, choose **Inference optimization**.

  The **Inference optimization** page shows a table that displays the jobs that you've created. For each job, it shows the optimization configurations that you applied and the job status.

**To view the details for a job**
+ On the **Inference optimization** page, in the summary table, choose the name of the job.

  Studio shows the job details page, which shows the job status and all of the settings that you applied when you created the job. If the job completed successfully, SageMaker AI stored the optimized model artifacts in the Amazon S3 location under **Optimized model S3 URI**.

# Evaluate the performance of optimized models
<a name="model-optimize-evaluate"></a>

After you use an optimization job to create an optimized model, you can run an evaluation of model performance. This evaluation yields metrics for latency, throughput, and price. Use these metrics to determine whether the optimized model meets the needs of your use case or whether it requires further optimization.

You can run performance evaluations only by using Studio. This feature is not provided through the Amazon SageMaker AI API or Python SDK.

## Before you begin
<a name="eval-prereqs"></a>

Before you can create a performance evaluation, you must first optimize a model by creating an inference optimization job. In Studio, you can evaluate only the models that you create with these jobs.

## Create the performance evaluation
<a name="create-perf-eval"></a>

Complete the following steps in Studio to create a performance evaluation for an optimized model.

1. In the Studio navigation menu, under **Jobs**, choose **Inference optimization**.

1. Choose the name of the job that created the optimized model that you want to evaluate.

1. On the job details page, choose **Evaluate performance**.

1. On the **Evaluate performance** page, some JumpStart models require you to sign an end-user license agreement (EULA) before you can proceed. If requested, review the license terms in the **License agreement** section. If the terms are acceptable for your use case, select the checkbox for **I accept the EULA, and read the terms and conditions.**

1. For **Select a model for tokenizer**, accept the default, or a choose a specific model to act as the tokenizer for your evaluation.

1. For **Input datasets**, choose whether to: 
   + Use the default sample datasets from SageMaker AI.
   + Provide an S3 URI that points to your own sample datasets.

1. For **S3 URI for performance results**, provide a URI that points to the location in Amazon S3 where you want to store the evaluation results.

1. Choose **Evaluate**.

   Studio shows the **Performance evaluations** page, where your evaluation job is shown in the table. The **Status** column shows the status of your evaluation.

1. When the status is **Completed**, choose the name of the job to see the evaluation results.

The evaluation details page shows tables that provide the performance metrics for latency, throughput, and price. For more information about each metric, see the [Metrics reference for inference performance evaluations](#performance-eval-metrics-reference).

## Metrics reference for inference performance evaluations
<a name="performance-eval-metrics-reference"></a>

After you successfully evaluate the performance of an optimized model, the evaluation details page in Studio shows the following metrics.

### Latency metrics
<a name="latency-metrics"></a>

The **Latency** section shows the following metrics

**Concurrency**  
The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

**Time to first token (ms)**  
The time that elapsed between when request is sent and when the first token of a streaming response is received.

**Inter-token latency (ms)**  
The time to generate an output token for each request.

**Client latency (ms)**  
The request latency from the time the request is sent to the time the entire response is received.

**Input tokens/sec (count)**  
The total number of generated input tokens, across all requests, divided by the total duration in seconds for the concurrency.

**Output tokens/sec (count)**  
The total number of generated output tokens, across all requests, divided by total duration in seconds for the concurrency.

**Client invocations (count)**  
The total number of inference requests sent to the endpoint across all users at a concurrency.

**Client invocation errors (count)**  
The total number of inference requests sent to the endpoint across all users at a given concurrency that resulted in an invocation error.

**Tokenizer failed (count)**  
The total number of inference requests where the tokenizer failed to parse the request or the response.

**Empty inference response (count)**  
The total number of inference requests that resulted in zero output tokens or the tokenizer failing to parse the response.

### Throughput metrics
<a name="throughput-metrics"></a>

The **Throughput** section shows the following metrics.

**Concurrency**  
The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

**Input tokens/sec/req (count)**  
The total number of generated input tokens per second per request.

**Output tokens/sec/req (count)**  
The total number of generated output tokens per second per request.

**Input tokens (count)**  
The total number of generated input tokens per request.

**Output tokens (count)**  
The total number of generated output tokens per request.

### Price metrics
<a name="price-metrics"></a>

The **Price** section shows the following metrics.

**Concurrency**  
The number of concurrent users that the evaluation simulated to invoke the endpoint simultaneously.

**Price per million input tokens**  
Cost of processing 1M input tokens.

**Price per million output tokens**  
Cost of generating 1M output tokens.

# Supported models reference
<a name="optimization-supported-models"></a>

The following tables show the models for which SageMaker AI support inference optimization, and they show the supported optimization techniques.


**Supported Llama models**  

| Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation | 
| --- | --- | --- | --- | --- | 
| Meta Llama 2 13B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 2 13B Chat |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 2 70B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 2 70B Chat |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 2 7B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 2 7B Chat |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 3 70B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 3 70B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 3 8B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Llama 3 8B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Meta Code Llama 13B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 13B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 13B Python |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 34B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 34B Instruct  |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 34B Python |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 70B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 70B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 70B Python |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 7B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 7B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Code Llama 7B Python |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Meta Llama 2 13B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 2 13B Chat Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 2 70B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 2 70B Chat Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 2 7B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 2 7B Chat Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3 70B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3 70B Instruct Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3 8B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3 8B Instruct Neuron | None | No | No |  AWS Neuron  | 
| Meta Code Llama 70B Neuron | None | No | No |  AWS Neuron  | 
| Meta Code Llama 7B Neuron | None | No | No |  AWS Neuron  | 
| Meta Code Llama 7B Python Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3.1 405B FP8 | None | Yes | Yes |  None  | 
| Meta Llama 3.1 405B Instruct FP8 | None | Yes | Yes |  None  | 
| Meta Llama 3.1 70B |  INT4-AWQ FP8  | Yes | Yes |  None  | 
| Meta Llama 3.1 70B Instruct |  INT4-AWQ FP8  | Yes | Yes |  None  | 
| Meta Llama 3.1 8B |  INT4-AWQ FP8  | Yes | Yes |  None  | 
| Meta Llama 3.1 8B Instruct |  INT4-AWQ FP8  | Yes | Yes |  None  | 
| Meta Llama 3.1 70B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3.1 70B Instruct Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3 1 8B Neuron | None | No | No |  AWS Neuron  | 
| Meta Llama 3.1 8B Instruct Neuron | None | No | No |  AWS Neuron  | 


**Supported Mistral models**  

| Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation | 
| --- | --- | --- | --- | --- | 
| Mistral 7B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Mistral 7B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  AWS Neuron TensorRT-LLM  | 
| Mistral 7B Neuron | None | No | No |  AWS Neuron  | 
| Mistral 7B Instruct Neuron | None | No | No |  AWS Neuron  | 


**Supported Mixtral models**  

| Model Name | Supported Data Formats for Quantization | Supports Speculative Decoding | Supports Fast Model Loading | Libraries Used for Compilation | 
| --- | --- | --- | --- | --- | 
| Mixtral-8x22B-Instruct-v0.1 |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Mixtral-8x22B V1 |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Mixtral 8x7B |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 
| Mixtral 8x7B Instruct |  INT4-AWQ INT8-SmoothQuant FP8  | Yes | Yes |  TensorRT-LLM  | 


**Supported Model Architectures and EAGLE Type**  

|  Model Architecture Name  |  EAGLE Type  | 
| --- | --- | 
|  LlamaForCausalLM  |  EAGLE 3  | 
|  Qwen3ForCausalLM  |  EAGLE 3  | 
|  Qwen3NextForCausalLM  |  EAGLE 2  | 
|  Qwen3MoeForCausalLM   |  EAGLE 3  | 
|  Qwen2ForCausalLM  |  EAGLE 3  | 
|  GptOssForCausalLM  |  EAGLE 3  | 

# Options for evaluating your machine learning model in Amazon SageMaker AI
<a name="how-it-works-model-validation"></a>

After training a model, evaluate it to determine whether its performance and accuracy enable you to achieve your business goals. You might generate multiple models using different methods and evaluate each. For example, you could apply different business rules for each model, and then apply various measures to determine each model's suitability. You might consider whether your model needs to be more sensitive than specific (or vice versa). 

You can evaluate your model using historical data (offline) or live data:
+ **Offline testing**—Use historical, not live, data to send requests to the model for inferences. 

  Deploy your trained model to an alpha endpoint, and use historical data to send inference requests to it. To send the requests, use a Jupyter notebook in your Amazon SageMaker AI notebook instance and either the AWS SDK for Python (Boto) or the high-level Python library provided by SageMaker AI.
+ **Online testing with live data**—SageMaker AI supports A/B testing for models in production by using production variants. Production variants are models that use the same inference code and are deployed on the same SageMaker AI endpoint. You configure the production variants so that a small portion of the live traffic goes to the model that you want to validate. For example, you might choose to send 10% of the traffic to a model variant for evaluation. After you are satisfied with the model's performance, you can route 100% traffic to the updated model. For an example of testing models in production, see [Testing models with production variants](model-ab-testing.md).

For more information, see articles and books about how to evaluate models, for example, [Evaluating Machine Learning Models](http://www.oreilly.com/data/free/evaluating-machine-learning-models.csp). 

Options for offline model evaluation include:
+ **Validating using a holdout set**—Machine learning practitioners often set aside a part of the data as a "holdout set." They don’t use this data for model training.

  With this approach, you evaluate how well your model provides inferences on the holdout set. You then assess how effectively the model generalizes what it learned in the initial training, as opposed to using model memory. This approach to validation gives you an idea of how often the model is able to infer the correct answer. 

   

  In some ways, this approach is similar to teaching elementary school students. First, you provide them with a set of examples to learn, and then test their ability to generalize from their learning. With homework and tests, you pose problems that were not included in the initial learning and determine whether they are able to generalize effectively. Students with perfect memories could memorize the problems, instead of learning the rules.

   

  Typically, the holdout dataset is of 20-30% of the training data.

   
+ **k-fold validation**—In this validation approach, you split the example dataset into *k* parts. You treat each of these parts as a holdout set for* k* training runs, and use the other *k*-1 parts as the training set for that run. You produce* k* models using a similar process, and aggregate the models to generate your final model. The value *k* is typically in the range of 5-10.

# Amazon SageMaker Inference Recommender
<a name="inference-recommender"></a>

Amazon SageMaker Inference Recommender is a capability of Amazon SageMaker AI. It reduces the time required to get machine learning (ML) models in production by automating load testing and model tuning across SageMaker AI ML instances. You can use Inference Recommender to deploy your model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost. Inference Recommender helps you select the best instance type and configuration for your ML models and workloads. It considers factors like instance count, container parameters, model optimizations, max concurrency, and memory size.

Amazon SageMaker Inference Recommender only charges you for the instances used while your jobs are executing.

## How it Works
<a name="inference-recommender-how-it-works"></a>

To use Amazon SageMaker Inference Recommender, you can either [create a SageMaker AI model](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) or register a model to the SageMaker Model Registry with your model artifacts. Use the AWS SDK for Python (Boto3) or the SageMaker AI console to run benchmarking jobs for different SageMaker AI endpoint configurations. Inference Recommender jobs help you collect and visualize metrics across performance and resource utilization to help you decide on which endpoint type and configuration to choose.

## How to Get Started
<a name="inference-recommender-get-started"></a>

If you are a first-time user of Amazon SageMaker Inference Recommender, we recommend that you do the following:

1. Read through the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md) section to make sure you have satisfied the requirements to use Amazon SageMaker Inference Recommender.

1. Read through the [Recommendation jobs with Amazon SageMaker Inference Recommender](inference-recommender-recommendation-jobs.md) section to launch your first Inference Recommender recommendation jobs.

1. Explore the introductory Amazon SageMaker Inference Recommender [Jupyter notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-inference-recommender/inference-recommender.ipynb) example, or review the example notebooks in the following section.

## Example notebooks
<a name="inference-recommender-notebooks"></a>

The following example Jupyter notebooks can help you with the workflows for multiple use cases in Inference Recommender:
+ If you want an introductory notebook that benchmarks a TensorFlow model, see the [SageMaker Inference Recommender TensorFlow](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/inference-recommender.ipynb) notebook.
+ If you want to benchmark a HuggingFace model, see the [SageMaker Inference Recommender for HuggingFace](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/huggingface-inference-recommender/huggingface-inference-recommender.ipynb) notebook.
+ If you want to benchmark an XGBoost model, see the [SageMaker Inference Recommender XGBoost](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/xgboost/xgboost-inference-recommender.ipynb) notebook.
+ If you want to review CloudWatch metrics for your Inference Recommender jobs, see the [SageMaker Inference Recommender CloudWatch metrics](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/tensorflow-cloudwatch/tf-cloudwatch-inference-recommender.ipynb) notebook.

# Prerequisites for using Amazon SageMaker Inference Recommender
<a name="inference-recommender-prerequisites"></a>

Before you can use Amazon SageMaker Inference Recommender, you must complete the prerequisite steps. As an example, we show how to use a PyTorch (v1.7.1) ResNet-18 pre-trained model for both types of Amazon SageMaker Inference Recommender recommendation jobs. The examples shown use the AWS SDK for Python (Boto3).

**Note**  
The following code examples use Python. Remove the `!` prefix character if you run any of the following code samples in your terminal or AWS CLI.
You can run the following examples with the Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized) kernel in an Amazon SageMaker Studio notebook. For more information about Studio, see [Amazon SageMaker Studio](studio-updated.md).

1. **Create an IAM role for Amazon SageMaker AI.**

   Create an IAM role for Amazon SageMaker AI that has the `AmazonSageMakerFullAccess` IAM managed policy attached.

1. **Set up your environment.**

   Import dependencies and create variables for your AWS Region, your SageMaker AI IAM role (from Step 1), and the SageMaker AI client.

   ```
   !pip install --upgrade pip awscli botocore boto3  --quiet
   from sagemaker import get_execution_role, Session, image_uris
   import boto3
   
   region = boto3.Session().region_name
   role = get_execution_role()
   sagemaker_client = boto3.client("sagemaker", region_name=region)
   sagemaker_session = Session()
   ```

1. **(Optional) Review existing models benchmarked by Inference Recommender.**

   Inference Recommender benchmarks models from popular model zoos. Inference Recommender supports your model even if it is not already benchmarked.

   Use `ListModelMetaData` to get a response object that lists the domain, framework, task, and model name of machine learning models found in common model zoos.

   You use the domain, framework, framework version, task, and model name in later steps to both select an inference Docker image and register your model with SageMaker Model Registry. The following demonstrates how to list model metadata with SDK for Python (Boto3): 

   ```
   list_model_metadata_response=sagemaker_client.list_model_metadata()
   ```

   The output includes model summaries (`ModelMetadataSummaries`) and response metadata (`ResponseMetadata`) similar to the following example:

   ```
   {
       'ModelMetadataSummaries': [{
               'Domain': 'NATURAL_LANGUAGE_PROCESSING',
               'Framework': 'PYTORCH:1.6.0',
                'Model': 'bert-base-cased',
                'Task': 'FILL_MASK'
                },
               {
                'Domain': 'NATURAL_LANGUAGE_PROCESSING',
                'Framework': 'PYTORCH:1.6.0',
                'Model': 'bert-base-uncased',
                'Task': 'FILL_MASK'
                },
               {
               'Domain': 'COMPUTER_VISION',
                'Framework': 'MXNET:1.8.0',
                'Model': 'resnet18v2-gluon',
                'Task': 'IMAGE_CLASSIFICATION'
                },
                {
                'Domain': 'COMPUTER_VISION',
                'Framework': 'PYTORCH:1.6.0',
                'Model': 'resnet152',
                'Task': 'IMAGE_CLASSIFICATION'
                }],
       'ResponseMetadata': {
                               'HTTPHeaders': {
                               'content-length': '2345',
                               'content-type': 'application/x-amz-json-1.1',
                               'date': 'Tue, 19 Oct 2021 20:52:03 GMT',
                               'x-amzn-requestid': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
                             },
       'HTTPStatusCode': 200,
       'RequestId': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
       'RetryAttempts': 0
       }
   }
   ```

   For this demo, we use a PyTorch (v1.7.1) ResNet-18 model to perform image classification. The following Python code sample stores the framework, framework version, domain, and task into variables for later use:

   ```
   # ML framework details
   framework = 'pytorch'
   framework_version = '1.7.1'
   
   # ML model details
   ml_domain = 'COMPUTER_VISION'
   ml_task = 'IMAGE_CLASSIFICATION'
   ```

1. **Upload your machine learning model to Amazon S3.**

   Use this PyTorch (v1.7.1) ResNet-18 model if you do not have a pre-trained machine learning model:

   ```
   # Optional: Download a sample PyTorch model
   import torch
   from torchvision import models, transforms, datasets
   
   # Create an example input for tracing
   image = torch.zeros([1, 3, 256, 256], dtype=torch.float32)
   
   # Load a pretrained resnet18 model from TorchHub
   model = models.resnet18(pretrained=True)
   
   # Tell the model we are using it for evaluation (not training). Note this is required for Inferentia compilation.
   model.eval()
   model_trace = torch.jit.trace(model, image)
   
   # Save your traced model
   model_trace.save('model.pth')
   ```

   Download a sample inference script `inference.py`. Create a `code` directory and move the inference script to the `code` directory.

   ```
   # Download the inference script
   !wget https://aws-ml-blog-artifacts.s3.us-east-2.amazonaws.com/inference.py
   
   # move it into a code/ directory
   !mkdir code
   !mv inference.py code/
   ```

   Amazon SageMaker AI requires pre-trained machine learning models to be packaged as a compressed TAR file (`*.tar.gz`). Compress your model and inference script to satisfy this requirement:

   ```
   !tar -czf test.tar.gz model.pth code/inference.py
   ```

   When your endpoint is provisioned, the files in the archive are extracted to `/opt/ml/model/` on the endpoint.

   After you compress your model and model artifacts as a `.tar.gz` file, upload them to your Amazon S3 bucket. The following example demonstrates how to upload your model to Amazon S3 using the AWS CLI:

   ```
   !aws s3 cp test.tar.gz s3://{your-bucket}/models/
   ```

1. **Select a prebuilt Docker inference image or create your own Inference Docker Image.**

   SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker AI images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

   If none of the existing SageMaker AI containers meet your needs and you don't have an existing container of your own, create a new Docker image. See [Containers with custom inference code](your-algorithms-inference-main.md) for information about how to create your Docker image.

   The following demonstrates how to retrieve a PyTorch version 1.7.1 inference image using the SageMaker Python SDK:

   ```
   from sagemaker import image_uris
   
   ## Uncomment and replace with your own values if you did not define  
   ## these variables a previous step.
   #framework = 'pytorch'
   #framework_version = '1.7.1'
   
   # Note: you can use any CPU-based instance here, 
   # this is just to set the arch as CPU for the Docker image
   instance_type = 'ml.m5.2xlarge' 
   
   image_uri = image_uris.retrieve(framework, 
                                   region, 
                                   version=framework_version, 
                                   py_version='py3', 
                                   instance_type=instance_type, 
                                   image_scope='inference')
   ```

   For a list of available SageMaker AI Instances, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

1. **Create a sample payload archive.**

   Create an archive that contains individual files that the load testing tool can send to your SageMaker AI endpoints. Your inference code must be able to read the file formats from the sample payload.

   The following downloads a .jpg image that this example uses in a later step for the ResNet-18 model.

   ```
   !wget https://cdn.pixabay.com/photo/2020/12/18/05/56/flowers-5841251_1280.jpg
   ```

   Compress the sample payload as a tarball:

   ```
   !tar -cvzf payload.tar.gz flowers-5841251_1280.jpg
   ```

   Upload the sample payload to Amazon S3 and note the Amazon S3 URI:

   ```
   !aws s3 cp payload.tar.gz s3://{bucket}/models/
   ```

   You need the Amazon S3 URI in a later step, so store it in a variable:

   ```
   bucket_prefix='models'
   bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
   payload_s3_key = f"{bucket_prefix}/payload.tar.gz"
   sample_payload_url= f"s3://{bucket}/{payload_s3_key}"
   ```

1. **Prepare your model input for the recommendations job**

   For the last prerequisite, you have two options to prepare your model input. You can either register your model with SageMaker Model Registry, which you can use to catalog models for production, or you can create a SageMaker AI model and specify it in the `ContainerConfig` field when creating a recommendations job. The first option is best if you want to take advantage of the features that [Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) provides, such as managing model versions and automating model deployment. The second option is ideal if you want to get started quickly. For the first option, go to step 7. For the second option, skip step 7 and go to step 8.

1. **Option 1: Register your model in the model registry**

   With SageMaker Model Registry, you can catalog models for production, manage model versions, associate metadata (such as training metrics) with a model, manage the approval status of a model, deploy models to production, and automate model deployment with CI/CD.

   When you use SageMaker Model Registry to track and manage your models, they are represented as a versioned model package within model package groups. Unversioned model packages are not part of a model group. Model package groups hold multiple versions or iterations of a model. Though it is not required to create them for every model in the registry, they help organize various models that all have the same purpose and provide automatic versioning.

   To use Amazon SageMaker Inference Recommender, you must have a versioned model package. You can create a versioned model package programmatically with the AWS SDK for Python (Boto3) or with Amazon SageMaker Studio Classic. To create a versioned model package programmatically, first create a model package group with the `CreateModelPackageGroup` API. Next, create a model package using the `CreateModelPackage` API. Calling this method makes a versioned model package.

   See [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md) for detailed instructions about how to programmatically and interactively create a model package group and how to create a versioned model package, respectively, with the AWS SDK for Python (Boto3) and Amazon SageMaker Studio Classic.

   The following code sample demonstrates how to create a versioned model package using the AWS SDK for Python (Boto3).
**Note**  
You do not need to approve the model package to create an Inference Recommender job.

   1. **Create a model package group**

      Create a model package group with the `CreateModelPackageGroup` API. Provide a name to the model package group for the `ModelPackageGroupName` and optionally provide a description of the model package in the `ModelPackageGroupDescription` field.

      ```
      model_package_group_name = '<INSERT>'
      model_package_group_description = '<INSERT>' 
      
      model_package_group_input_dict = {
       "ModelPackageGroupName" : model_package_group_name,
       "ModelPackageGroupDescription" : model_package_group_description,
      }
      
      model_package_group_response = sagemaker_client.create_model_package_group(**model_package_group_input_dict)
      ```

      See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackageGroup.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackageGroup.html).

      Create a model package by specifying a Docker image that runs your inference code and the Amazon S3 location of your model artifacts and provide values for `InferenceSpecification`. `InferenceSpecification` should contain information about inference jobs that can be run with models based on this model package, including the following:
      + The Amazon ECR paths of images that run your inference code.
      + (Optional) The instance types that the model package supports for transform jobs and real-time endpoints used for inference.
      + The input and output content formats that the model package supports for inference.

      In addition, you must specify the following parameters when you create a model package:
      + [Domain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Domain): The machine learning domain of your model package and its components. Common machine learning domains include computer vision and natural language processing.
      + [Task](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Task): The machine learning task your model package accomplishes. Common machine learning tasks include object detection and image classification. Specify "OTHER" if none of the tasks listed in the [API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) satisfy your use case. See the [Task](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Task) API field descriptions for a list of supported machine learning tasks.
      + [SamplePayloadUrl](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-SamplePayloadUrl): The Amazon Simple Storage Service (Amazon S3) path where the sample payload are stored. This path must point to a single GZIP compressed TAR archive (.tar.gz suffix).
      + [Framework](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelPackageContainerDefinition.html#sagemaker-Type-ModelPackageContainerDefinition-Framework): The machine learning framework of the model package container image.
      + [FrameworkVersion](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelPackageContainerDefinition.html#sagemaker-Type-ModelPackageContainerDefinition-FrameworkVersion): The framework version of the model package container image.

      If you provide an allow list of instance types to use to generate inferences in real-time for the [SupportedRealtimeInferenceInstanceTypes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InferenceSpecification.html#sagemaker-Type-InferenceSpecification-SupportedRealtimeInferenceInstanceTypes), Inference Recommender limits the search space for instance types during a `Default` job. Use this parameter if you have budget constraints or know there's a specific set of instance types that can support your model and container image.

      In a previous step, we downloaded a pre-trained ResNet18 model and stored it in an Amazon S3 bucket in a directory called `models`. We retrieved a PyTorch (v1.7.1) Deep Learning Container inference image and stored the URI in a variable called `image_uri`. Use those variables in the following code sample to define a dictionary used as input to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html) API.

      ```
      # Provide the Amazon S3 URI of your compressed tarfile
      # so that Model Registry knows where to find your model artifacts
      bucket_prefix='models'
      bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
      model_s3_key = f"{bucket_prefix}/test.tar.gz"
      model_url= f"s3://{bucket}/{model_s3_key}"
      
      # Similar open source model to the packaged model
      # The name of the ML model as standardized by common model zoos
      nearest_model_name = 'resnet18'
      
      # The supported MIME types for input and output data. In this example, 
      # we are using images as input.
      input_content_type='image/jpeg'
      
      
      # Optional - provide a description of your model.
      model_package_description = '<INSERT>'
      
      ## Uncomment if you did not store the domain and task in an earlier
      ## step 
      #ml_domain = 'COMPUTER_VISION'
      #ml_task = 'IMAGE_CLASSIFICATION'
      
      ## Uncomment if you did not store the framework and framework version
      ## in a previous step.
      #framework = 'PYTORCH'
      #framework_version = '1.7.1'
      
      # Optional: Used for optimizing your model using SageMaker Neo
      # PyTorch uses NCHW format for images
      data_input_configuration = "[[1,3,256,256]]"
      
      # Create a dictionary to use as input for creating a model pacakge group
      model_package_input_dict = {
              "ModelPackageGroupName" : model_package_group_name,
              "ModelPackageDescription" : model_package_description,
              "Domain": ml_domain,
              "Task": ml_task,
              "SamplePayloadUrl": sample_payload_url,
              "InferenceSpecification": {
                      "Containers": [
                          {
                              "Image": image_uri,
                              "ModelDataUrl": model_url,
                              "Framework": framework.upper(), 
                              "FrameworkVersion": framework_version,
                              "NearestModelName": nearest_model_name,
                              "ModelInput": {"DataInputConfig": data_input_configuration}
                          }
                          ],
                      "SupportedContentTypes": [input_content_type]
              }
          }
      ```

   1. **Create a model package**

      Use the `CreateModelPackage` API to create a model package. Pass the input dictionary defined in the previous step:

      ```
      model_package_response = sagemaker_client.create_model_package(**model_package_input_dict)
      ```

      You need the model package ARN to use Amazon SageMaker Inference Recommender. Note the ARN of the model package or store it in a variable:

      ```
      model_package_arn = model_package_response["ModelPackageArn"]
      
      print('ModelPackage Version ARN : {}'.format(model_package_arn))
      ```

1. **Option 2: Create a model and configure the `ContainerConfig` field**

   Use this option if you want to start an inference recommendations job and don't need to register your model in the Model Registry. In the following steps, you create a model in SageMaker AI and configure the `ContainerConfig` field as input for the recommendations job.

   1. **Create a model**

      Create a model with the `CreateModel` API. For an example that calls this method when deploying a model to SageMaker AI Hosting, see [Create a Model (AWS SDK for Python (Boto3))](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html#realtime-endpoints-deployment-create-model).

      In a previous step, we downloaded a pre-trained ResNet18 model and stored it in an Amazon S3 bucket in a directory called `models`. We retrieved a PyTorch (v1.7.1) Deep Learning Container inference image and stored the URI in a variable called `image_uri`. We use those variables in the following code example where we define a dictionary used as input to the `[CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ModelName)` API.

      ```
      model_name = '<name_of_the_model>'
      # Role to give SageMaker permission to access AWS services.
      sagemaker_role= "arn:aws:iam::<region>:<account>:role/*"
      
      # Provide the Amazon S3 URI of your compressed tarfile
      # so that Model Registry knows where to find your model artifacts
      bucket_prefix='models'
      bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
      model_s3_key = f"{bucket_prefix}/test.tar.gz"
      model_url= f"s3://{bucket}/{model_s3_key}"
      
      #Create model
      create_model_response = sagemaker_client.create_model(
          ModelName = model_name,
          ExecutionRoleArn = sagemaker_role, 
          PrimaryContainer = {
              'Image': image_uri,
              'ModelDataUrl': model_url,
          })
      ```

   1. **Configure the `ContainerConfig` field**

      Next, you must configure the [ContainerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RecommendationJobInputConfig.html#sagemaker-Type-RecommendationJobInputConfig-ContainerConfig) field with the model you just created and specify the following parameters in it:
      + `Domain`: The machine learning domain of the model and its components, such as computer vision or natural language processing.
      + `Task`: The machine learning task that the model accomplishes, such as image classification or object detection.
      + `PayloadConfig`: The configuration for the payload for a recommendation job. For more information about the subfields, see `[RecommendationJobPayloadConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RecommendationJobPayloadConfig.html#sagemaker-Type-RecommendationJobPayloadConfig-SamplePayloadUrl)`.
      + `Framework`: The machine learning framework of the container image, such as PyTorch.
      + `FrameworkVersion`: The framework version of the container image.
      + (Optional) `SupportedInstanceTypes`: A list of the instance types that are used to generate inferences in real-time.

      If you use the `SupportedInstanceTypes` parameter, Inference Recommender limits the search space for instance types during a `Default` job. Use this parameter if you have budget constraints or know there's a specific set of instance types that can support your model and container image.

      In the following code example, we use the previously defined parameters, along with `NearestModelName`, to define a dictionary used as input to the `[CreateInferenceRecommendationsJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html)` API.

      ```
      ## Uncomment if you did not store the domain and task in a previous step
      #ml_domain = 'COMPUTER_VISION'
      #ml_task = 'IMAGE_CLASSIFICATION'
      
      ## Uncomment if you did not store the framework and framework version in a previous step
      #framework = 'PYTORCH'
      #framework_version = '1.7.1'
      
      # The name of the ML model as standardized by common model zoos
      nearest_model_name = 'resnet18'
      
      # The supported MIME types for input and output data. In this example, 
      # we are using images as input
      input_content_type='image/jpeg'
      
      # Optional: Used for optimizing your model using SageMaker Neo
      # PyTorch uses NCHW format for images
      data_input_configuration = "[[1,3,256,256]]"
      
      # Create a dictionary to use as input for creating an inference recommendation job
      container_config = {
              "Domain": ml_domain,
              "Framework": framework.upper(), 
              "FrameworkVersion": framework_version,
              "NearestModelName": nearest_model_name,
              "PayloadConfig": { 
                  "SamplePayloadUrl": sample_payload_url,
                  "SupportedContentTypes": [ input_content_type ]
               },
              "DataInputConfig": data_input_configuration
              "Task": ml_task,
              }
      ```

# Recommendation jobs with Amazon SageMaker Inference Recommender
<a name="inference-recommender-recommendation-jobs"></a>

Amazon SageMaker Inference Recommender can make two types of recommendations:

1. Inference recommendations (`Default` job type) run a set of load tests on the recommended instance types. You can also load test for a serverless endpoint.. You only need to provide a model package Amazon Resource Name (ARN) to launch this type of recommendation job. Inference recommendation jobs complete within 45 minutes.

1. Endpoint recommendations (`Advanced` job type) are based on a custom load test where you select your desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements. This job takes an average of 2 hours to complete depending on the job duration set and the total number of inference configurations tested.

Both types of recommendations use the same APIs to create, describe, and stop jobs. The output is a list of instance configuration recommendations with associated environment variables, cost, throughput, and latency metrics. Recommendation jobs also provide an initial instance count, which you can use to configure an autoscaling policy. To differentiate between the two types of jobs, when you’re creating a job through either the SageMaker AI console or the APIs, specify `Default` to create preliminary endpoint recommendations and `Advanced` for custom load testing and endpoint recommendations.

**Note**  
You do not need to do both types of recommendation jobs in your own workflow. You can do either independently of the other.

Inference Recommender can also provide you with a list of prospective instances, or the top five instance types that are optimized for cost, throughput and latency for model deployment, along with a confidence score. You can choose these instances when deploying your model. Inference Recommender automatically performs benchmarking against your model for you to provide the prospective instances. Since these are preliminary recommendations, we recommend that you run further instance recommendation jobs to get more accurate results. To view the prospective instances, go to your SageMaker AI model details page. For more information, see [Get instant prospective instances](inference-recommender-prospective.md).

**Topics**
+ [

# Get instant prospective instances
](inference-recommender-prospective.md)
+ [

# Inference recommendations
](inference-recommender-instance-recommendation.md)
+ [

# Get an inference recommendation for an existing endpoint
](inference-recommender-existing-endpoint.md)
+ [

# Stop your inference recommendation
](instance-recommendation-stop.md)
+ [

# Compiled recommendations with Neo
](inference-recommender-neo-compilation.md)
+ [

# Recommendation results
](inference-recommender-interpret-results.md)
+ [

# Get autoscaling policy recommendations
](inference-recommender-autoscaling.md)
+ [

# Run a custom load test
](inference-recommender-load-test.md)
+ [

# Stop your load test
](load-test-stop.md)
+ [

# Troubleshoot Inference Recommender errors
](inference-recommender-troubleshooting.md)

# Get instant prospective instances
<a name="inference-recommender-prospective"></a>

Inference Recommender can also provide you with a list of *prospective instances*, or instance types that might be suitable for your model, on your SageMaker AI model details page. Inference Recommender automatically performs preliminary benchmarking against your model for you to provide the top five prospective instances. Since these are preliminary recommendations, we recommend that you run further instance recommendation jobs to get more accurate results.

You can view a list of prospective instances for your model either programmatically by using the [DescribeModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModel.html) API, the SageMaker Python SDK, or the SageMaker AI console.

**Note**  
You won’t get prospective instances for models that you created in SageMaker AI before this feature became available.

To view the prospective instances for your model through the console, do the following:

1. Go to the SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Models**.

1. From the list of models, choose your model.

On the details page for your model, go to the **Prospective instances to deploy model** section. The following screenshot shows this section.

![\[Screenshot of the list of prospective instances on the model details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-prospective.png)


In this section, you can view the prospective instances that are optimized for cost, throughput, and latency for model deployment, along with additional information for each instance type such as the memory size, CPU and GPU count, and cost per hour.

If you decide that you want to benchmark a sample payload and run a full inference recommendation job for your model, you can start a default inference recommendation job from this page. To start a default job through the console, do the following:

1. On your model details page in the **Prospective instances to deploy model section**, choose **Run Inference recommender job**.

1. In the dialog box that pops up, for **S3 bucket for benchmarking payload**, enter the Amazon S3 location where you’ve stored a sample payload for your model.

1. For **Payload content type**, enter the MIME types for your payload data.

1. (Optional) In the **Model compilation using SageMaker Neo** section, for the **Data input configuration**, enter a data shape in dictionary format.

1. Choose **Run job**.

Inference Recommender starts the job, and you can view the job and its results from the **Inference recommender** list page in the SageMaker AI console.

If you want to run an advanced job and perform custom load tests, or if you want to configure additional settings and parameters for your job, see [Run a custom load test](inference-recommender-load-test.md).

# Inference recommendations
<a name="inference-recommender-instance-recommendation"></a>

Inference recommendation jobs run a set of load tests on recommended instance types or a serverless endpoint. Inference recommendation jobs use performance metrics that are based on load tests using the sample data you provided during model version registration.

**Note**  
Before you create an Inference Recommender recommendation job, make sure you have satisfied the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md).

The following demonstrates how to use Amazon SageMaker Inference Recommender to create an inference recommendation based on your model type using the AWS SDK for Python (Boto3), AWS CLI, and Amazon SageMaker Studio Classic, and the SageMaker AI console

**Topics**
+ [

# Create an inference recommendation
](instance-recommendation-create.md)
+ [

# Get your inference recommendation job results
](instance-recommendation-results.md)

# Create an inference recommendation
<a name="instance-recommendation-create"></a>

Create an inference recommendation programmatically using the AWS SDK for Python (Boto3) or the AWS CLI, or interactively using Studio Classic or the SageMaker AI console. Specify a job name for your inference recommendation, an AWS IAM role ARN, an input configuration, and either a model package ARN when you registered your model with the model registry, or your model name and a `ContainerConfig` dictionary from when you created your model in the **Prerequisites** section.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html) API to start an inference recommendation job. Set the `JobType` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ A model package ARN or model name. Inference Recommender supports either one model package ARN or a model name as input. Specify one of the following:
  + The ARN of the versioned model package you created when you registered your model with SageMaker AI model registry. Define this for `ModelPackageVersionArn` in the `InputConfig` field.
  + The name of the model you created. Define this for `ModelName` in the `InputConfig` field. Also, provide the `ContainerConfig` dictionary, which includes the required fields that need to be provided with the model name. Define this for `ContainerConfig` in the `InputConfig` field. In the `ContainerConfig`, you can also optionally specify the `SupportedEndpointType` field as either `RealTime` or `Serverless`. If you specify this field, Inference Recommender returns recommendations for only that endpoint type. If you don't specify this field, Inference Recommender returns recommendations for both endpoint types.
+ A name for your Inference Recommender recommendation job for the `JobName` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.

Import the AWS SDK for Python (Boto3) package and create a SageMaker AI client object using the client class. If you followed the steps in the **Prerequisites** section, only specify one of the following:
+ Option 1: If you would like to create an inference recommendations job with a model package ARN, then store the model package group ARN in a variable named `model_package_arn`.
+ Option 2: If you would like to create an inference recommendations job with a model name and `ContainerConfig`, store the model name in a variable named `model_name` and the `ContainerConfig` dictionary in a variable named `container_config`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<INSERT>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide only one of model package ARN or model name, not both.
# Provide your model package ARN that was created when you registered your 
# model with Model Registry 
model_package_arn = '<INSERT>'
## Uncomment if you would like to create an inference recommendations job with a
## model name instead of a model package ARN, and comment out model_package_arn above
## Provide your model name
# model_name = '<INSERT>'
## Provide your container config 
# container_config = '<INSERT>'

# Provide a unique job name for SageMaker Inference Recommender job
job_name = '<INSERT>'

# Inference Recommender job type. Set to Default to get an initial recommendation
job_type = 'Default'

# Provide an IAM Role that gives SageMaker Inference Recommender permission to 
# access AWS services
role_arn = 'arn:aws:iam::<account>:role/*'

sagemaker_client.create_inference_recommendations_job(
    JobName = job_name,
    JobType = job_type,
    RoleArn = role_arn,
    # Provide only one of model package ARN or model name, not both. 
    # If you would like to create an inference recommendations job with a model name,
    # uncomment ModelName and ContainerConfig, and comment out ModelPackageVersionArn.
    InputConfig = {
        'ModelPackageVersionArn': model_package_arn
        # 'ModelName': model_name,
        # 'ContainerConfig': container_config
    }
)
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html).

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to start an inference recommendation job. Set the `job-type` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ The Amazon Resource Name (ARN) of an IAM role that enables Amazon SageMaker Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ A model package ARN or model name. Inference Recommender supports either one model package ARN or a model name as input. Specify one of the following
  + The ARN of the versioned model package you created when you registered your model with Model Registry. Define this for `ModelPackageVersionArn` in the `input-config` field.
  + The name of the model you created. Define this for `ModelName` in the `input-config` field. Also, provide the `ContainerConfig` dictionary which includes the required fields that need to be provided with the model name. Define this for `ContainerConfig` in the `input-config` field. In the `ContainerConfig`, you can also optionally specify the `SupportedEndpointType` field as either `RealTime` or `Serverless`. If you specify this field, Inference Recommender returns recommendations for only that endpoint type. If you don't specify this field, Inference Recommender returns recommendations for both endpoint types.
+ A name for your Inference Recommender recommendation job for the `job-name` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.

To create an inference recommendation jobs with a model package ARN, use the following example:

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region:account:role/*>\",
        }"
```

To create an inference recommendation jobs with a model name and `ContainerConfig`, use the following example. The example uses the `SupportedEndpointType` field to specify that we only want to return real-time inference recommendations:

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelName\": \"model-name\",
        \"ContainerConfig\" : {
                \"Domain\": \"COMPUTER_VISION\",
                \"Framework\": \"PYTORCH\",
                \"FrameworkVersion\": \"1.7.1\",
                \"NearestModelName\": \"resnet18\",
                \"PayloadConfig\": 
                    {
                        \"SamplePayloadUrl\": \"s3://{bucket}/{payload_s3_key}\", 
                        \"SupportedContentTypes\": [\"image/jpeg\"]
                    },
                \"SupportedEndpointType\": \"RealTime\",
                \"DataInputConfig\": \"[[1,3,256,256]]\",
                \"Task\": \"IMAGE_CLASSIFICATION\",
            },
        }"
```

------
#### [ Amazon SageMaker Studio Classic ]

Create an inference recommendation job in Studio Classic.

1. In your Studio Classic application, choose the home icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)).

1. In the left sidebar of Studio Classic, choose **Models**.

1. Choose **Model Registry** from the dropdown list to display models you have registered with the model registry.

   The left panel displays a list of model groups. The list includes all the model groups registered with the model registry in your account, including models registered outside of Studio Classic.

1. Select the name of your model group. When you select your model group, the right pane of Studio Classic displays column heads such as **Versions** and **Setting**.

   If you have one or more model packages within your model group, you see a list of those model packages within the **Versions** column.

1. Choose the **Inference recommender** column.

1. Choose an IAM role that grants Inference Recommender permission to access AWS services. You can create a role and attach the `AmazonSageMakerFullAccess` IAM managed policy to accomplish this. Or you can let Studio Classic create a role for you.

1. Choose **Get recommendations**.

   The inference recommendation can take up to 45 minutes.
**Warning**  
Do not close this tab. If you close this tab, you cancel the instance recommendation job.

------
#### [ SageMaker AI console ]

Create an instance recommendation job through the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose **Create job**.

1. For **Step 1: Model configuration**, do the following:

   1. For **Job type**, choose **Default recommender job**.

   1. If you’re using a model registered in the SageMaker AI model registry, then turn on the **Choose a model from the model registry** toggle and do the following:

      1. From the **Model group** dropdown list, choose the model group in SageMaker AI model registry where your model is located.

      1. From the **Model version** dropdown list, choose the desired version of your model.

   1. If you’re using a model that you’ve created in SageMaker AI, then turn off the **Choose a model from the model registry toggle** and do the following:

      1. For the **Model name** field, enter the name of your SageMaker AI model.

   1. From the **IAM role** dropdown list, you can select an existing AWS IAM role that has the necessary permissions to create an instance recommendation job. Alternatively, if you don’t have an existing role, you can choose **Create a new role** to open the role creation pop-up, and SageMaker AI adds the necessary permissions to the new role that you create.

   1. For **S3 bucket for benchmarking payload**, enter the Amazon S3 path to your sample payload archive, which should contain sample payload files that Inference Recommender uses to benchmark your model on different instance types.

   1. For **Payload content type**, enter the MIME types of your sample payload data.

   1. (Optional) If you turned off the **Choose a model from the model registry toggle** and specified a SageMaker AI model, then for **Container configuration**, do the following:

      1. For the **Domain** dropdown list, select the machine learning domain of the model, such as computer vision, natural language processing, or machine learning.

      1. For the **Framework** dropdown list, select the framework of your container, such as TensorFlow or XGBoost.

      1. For **Framework version**, enter the framework version of your container image.

      1. For the **Nearest model name** dropdown list, select the pre-trained model that mostly closely matches your own.

      1. For the **Task** dropdown list, select the machine learning task that the model accomplishes, such as image classification or regression.

   1. (Optional) For **Model compilation using SageMaker Neo**, you can configure the recommendation job for a model that you’ve compiled using SageMaker Neo. For **Data input configuration**, enter the correct input data shape for your model in a format similar to `{'input':[1,1024,1024,3]}`.

   1. Choose **Next**.

1. For **Step 2: Instances and environment parameters**, do the following:

   1. (Optional) For **Select instances for benchmarking**, you can select up to 8 instance types that you want to benchmark. If you don’t select any instances, Inference Recommender considers all instance types.

   1. Choose **Next**.

1. For **Step 3: Job parameters**, do the following:

   1. (Optional) For the **Job name** field, enter a name for your instance recommendation job. When you create the job, SageMaker AI appends a timestamp to the end of this name.

   1. (Optional) For the **Job description** field, enter a description for the job.

   1. (Optional) For the **Encryption key** dropdown list, choose an AWS KMS key by name or enter its ARN to encrypt your data.

   1. (Optional) For **Max test duration (s)**, enter the maximum number of seconds you want each test to run for.

   1. (Optional) For **Max invocations per minute**, enter the maximum number of requests per minute the endpoint can reach before stopping the recommendation job. After reaching this limit, SageMaker AI ends the job.

   1. (Optional) For **P99 Model latency threshold (ms)**, enter the model latency percentile in milliseconds.

   1. Choose **Next**.

1. For **Step 4: Review job**, review your configurations and then choose **Submit**.

------

# Get your inference recommendation job results
<a name="instance-recommendation-results"></a>

Collect the results of your inference recommendation job programmatically with AWS SDK for Python (Boto3), the AWS CLI, Studio Classic, or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Once an inference recommendation is complete, you can use `DescribeInferenceRecommendationsJob` to get the job details and recommendations. Provide the job name that you used when you created the inference recommendation job.

```
job_name='<INSERT>'
response = sagemaker_client.describe_inference_recommendations_job(
                    JobName=job_name)
```

Print the response object. The previous code sample stored the response in a variable named `response`.

```
print(response['Status'])
```

This returns a JSON response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Default', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 20, 4, 57, 627000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 20, 25, 1, 997000, tzinfo=tzlocal()), 
    'InputConfig': {
                'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
                'JobDurationInSeconds': 0
                }, 
    'InferenceRecommendations': [{
            'Metrics': {
                'CostPerHour': 0.20399999618530273, 
                'CostPerInference': 5.246913588052848e-06, 
                'MaximumInvocations': 648, 
                'ModelLatency': 263596
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5.xlarge', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
         }, 
         {
            'Metrics': {
                'CostPerHour': 0.11500000208616257, 
                'CostPerInference': 2.92620870823157e-06, 
                'MaximumInvocations': 655, 
                'ModelLatency': 826019
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5d.large', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
            }, 
            {
                'Metrics': {
                    'CostPerHour': 0.11500000208616257, 
                    'CostPerInference': 3.3625731248321244e-06, 
                    'MaximumInvocations': 570, 
                    'ModelLatency': 1085446
                    }, 
                'EndpointConfiguration': {
                    'EndpointName': 'endpoint-name', 
                    'VariantName': 'variant-name', 
                    'InstanceType': 'ml.m5.large', 
                    'InitialInstanceCount': 1
                    }, 
                'ModelConfiguration': {
                    'Compiled': False, 
                    'EnvironmentParameters': []
                    }
            }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1685', 
            'date': 'Tue, 26 Oct 2021 20:31:10 GMT'
            }, 
        'RetryAttempts': 0
        }
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and deletion times. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) that was used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for an inference recommendations job configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the metrics returned for a serverless endpoint with the given `MemorySizeInMB` and when `MaxConcurrency = 1`. To increase the throughput possible on the endpoint, increase the value of `MaxConcurrency` linearly. For example, if the inference recommendation shows `MaxInvocations` as `1000`, then increasing `MaxConcurrency` to `2` would support 2000 `MaxInvocations`. Note that this is true only up to a certain point, which can vary based on your model and code. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ AWS CLI ]

Once an inference recommendation is complete, you can use `describe-inference-recommendations-job` to get the job details and recommended instance types. Provide the job name that you used when you created the inference recommendation job.

```
aws sagemaker describe-inference-recommendations-job\
    --job-name <job-name>\
    --region <aws-region>
```

The JSON response similar should resemble the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Default', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 20, 4, 57, 627000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 20, 25, 1, 997000, tzinfo=tzlocal()), 
    'InputConfig': {
                'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
                'JobDurationInSeconds': 0
                }, 
    'InferenceRecommendations': [{
            'Metrics': {
                'CostPerHour': 0.20399999618530273, 
                'CostPerInference': 5.246913588052848e-06, 
                'MaximumInvocations': 648, 
                'ModelLatency': 263596
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5.xlarge', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
         }, 
         {
            'Metrics': {
                'CostPerHour': 0.11500000208616257, 
                'CostPerInference': 2.92620870823157e-06, 
                'MaximumInvocations': 655, 
                'ModelLatency': 826019
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5d.large', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
            }, 
            {
                'Metrics': {
                    'CostPerHour': 0.11500000208616257, 
                    'CostPerInference': 3.3625731248321244e-06, 
                    'MaximumInvocations': 570, 
                    'ModelLatency': 1085446
                    }, 
                'EndpointConfiguration': {
                    'EndpointName': 'endpoint-name', 
                    'VariantName': 'variant-name', 
                    'InstanceType': 'ml.m5.large', 
                    'InitialInstanceCount': 1
                    }, 
                'ModelConfiguration': {
                    'Compiled': False, 
                    'EnvironmentParameters': []
                    }
            }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1685', 
            'date': 'Tue, 26 Oct 2021 20:31:10 GMT'
            }, 
        'RetryAttempts': 0
        }
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in milliseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for an inference recommendations job configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the metrics returned for a serverless endpoint with the given `MemorySizeInMB` and when `MaxConcurrency = 1`. To increase the throughput possible on the endpoint, increase the value of `MaxConcurrency` linearly. For example, if the inference recommendation shows `MaxInvocations` as `1000`, then increasing `MaxConcurrency` to `2` would support 2000 `MaxInvocations`. Note that this is true only up to a certain point, which can vary based on your model and code. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ Amazon SageMaker Studio Classic ]

The inference recommendations populate in a new **Inference recommendations** tab within Studio Classic. It can take up to 45 minutes for the results to show up. This tab contains **Results** and **Details** column headings.

The **Details** column provides information about the inference recommendation job, such as the name of the inference recommendation, when the job was created (**Creation time**), and more. It also provides **Settings** information, such as the maximum number of invocations that occurred per minute and information about the Amazon Resource Names used.

The **Results** column provides a ** Deployment goals** and **SageMaker AI recommendations** window in which you can adjust the order that the results are displayed based on deployment importance. There are three dropdown menus that you can use to provide the level of importance of the **Cost**, **Latency**, and **Throughput** for your use case. For each goal (cost, latency, and throughput), you can set the level of importance: **Lowest Importance**, **Low Importance**, **Moderate importance**, **High importance**, or **Highest importance**. 

Based on your selections of importance for each goal, Inference Recommender displays its top recommendation in the **SageMaker recommendation** field on the right of the panel, along with the estimated cost per hour and inference request. It also provides information about the expected model latency, maximum number of invocations, and the number of instances. For serverless recommendations, you can see the ideal values for the maximum concurrency and endpoint memory size.

In addition to the top recommendation displayed, you can also see the same information displayed for all instances that Inference Recommender tested in the **All runs** section.

------
#### [ SageMaker AI console ]

You can view your instance recommendation jobs in the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose the name of your inference recommendation job.

On the details page for your job, you can view the **Inference recommendations**, which are the instance types SageMaker AI recommends for your model, as shown in the following screenshot.

![\[Screenshot of the inference recommendations list on the job details page in the SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-instant-recs.png)


In this section, you can compare the instance types by various factors such as **Model latency**, **Cost per hour**, **Cost per inference**, and **Invocations per minute**.

On this page, you can also view the configurations you specified for your job. In the **Monitor** section, you can view the Amazon CloudWatch metrics that were logged for each instance type. To learn more about interpreting these metrics, see [Interpret results](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html).

------

For more information about interpreting the results of your recommendation job, see [Recommendation results](inference-recommender-interpret-results.md).

# Get an inference recommendation for an existing endpoint
<a name="inference-recommender-existing-endpoint"></a>

Inference recommendation jobs run a set of load tests on recommended instance types and an existing endpoint. Inference recommendation jobs use performance metrics that are based on load tests using the sample data you provided during model version registration.

You can benchmark and get inference recommendations for an existing SageMaker AI Inference endpoint to help you improve the performance of your endpoint. The procedure of getting recommendations for an existing SageMaker AI Inference endpoint is similar to the procedure for [getting inference recommendations](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-instance-recommendation.html) without an endpoint. There are several feature exclusions to take note of when benchmarking an existing endpoint:
+ You can only use one existing endpoint per Inference Recommender job.
+ You can only have one variant on your endpoint.
+ You can’t use an endpoint that enables autoscaling.
+ This functionality is only supported for [Real-Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).
+ This functionality doesn’t support [Real-Time Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html).

**Warning**  
We strongly recommend that you don't run an Inference Recommender job on a production endpoint that handles live traffic. The synthetic load during benchmarking can affect your production endpoint and cause throttling or provide inaccurate benchmark results. We recommend that you use a non-production or developer endpoint for comparison purposes. 

The following sections demonstrate how to use Amazon SageMaker Inference Recommender to create an inference recommendation for an existing endpoint based on your model type using the AWS SDK for Python (Boto3) and the AWS CLI.

**Note**  
Before you create an Inference Recommender recommendation job, make sure you have satisfied the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md).

## Prerequisites
<a name="inference-recommender-existing-endpoint-prerequisites"></a>

If you don’t already have a SageMaker AI Inference endpoint, you can either [get an inference recommendation](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-instance-recommendation.html) without an endpoint, or you can create a Real-Time Inference endpoint by following the instructions in [Create your endpoint and deploy your model](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html).

## Create an inference recommendation job for an existing endpoint
<a name="inference-recommender-existing-endpoint-create"></a>

Create an inference recommendation programmatically using AWS SDK for Python (Boto3), or the AWS CLI. Specify a job name for your inference recommendation, the name of an existing SageMaker AI Inference endpoint, an AWS IAM role ARN, an input configuration, and your model package ARN from when you registered your model with the model registry.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html) API to get an inference recommendation. Set the `JobType` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ Provide a name for your Inference Recommender recommendation job for the `JobName` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ The ARN of the versioned model package you created when you registered your model with the model registry. Define this for `ModelPackageVersionArn` in the `InputConfig` field.
+ Provide the name of an existing SageMaker AI Inference endpoint that you want to benchmark in Inference Recommender for `Endpoints` in the `InputConfig` field.

Import the AWS SDK for Python (Boto3) package and create a SageMaker AI client object using the client class. If you followed the steps in the **Prerequisites** section, the model package group ARN was stored in a variable named `model_package_arn`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<region>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide your model package ARN that was created when you registered your 
# model with Model Registry 
model_package_arn = '<model-package-arn>'

# Provide a unique job name for SageMaker Inference Recommender job
job_name = '<job-name>'

# Inference Recommender job type. Set to Default to get an initial recommendation
job_type = 'Default'

# Provide an IAM Role that gives SageMaker Inference Recommender permission to 
# access AWS services
role_arn = '<arn:aws:iam::<account>:role/*>'
                                    
# Provide endpoint name for your endpoint that want to benchmark in Inference Recommender
endpoint_name = '<existing-endpoint-name>'

sagemaker_client.create_inference_recommendations_job(
    JobName = job_name,
    JobType = job_type,
    RoleArn = role_arn,
    InputConfig = {
        'ModelPackageVersionArn': model_package_arn,
        'Endpoints': [{'EndpointName': endpoint_name}]
    }
)
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html).

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to get an instance endpoint recommendation. Set the `job-type` field to `'Default'` for instance endpoint recommendation jobs. In addition, provide the following:
+ Provide a name for your Inference Recommender recommendation job for the `job-name` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Amazon SageMaker Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ The ARN of the versioned model package you created when you registered your model with Model Registry. Define this for `ModelPackageVersionArn` in the `input-config` field.
+ Provide the name of an existing SageMaker AI Inference endpoint that you want to benchmark in Inference Recommender for `Endpoints` in the `input-config` field.

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region:account:role/*>\",
        \"Endpoints\": [{\"EndpointName\": <endpoint_name>}]
        }"
```

------

## Get your inference recommendation job results
<a name="inference-recommender-existing-endpoint-results"></a>

You can collect the results of your inference recommendation job programmatically with the same procedure for standard inference recommendation jobs. For more information, see [Get your inference recommendation job results](instance-recommendation-results.md).

When you get inference recommendation job results for an existing endpoint, you should receive a JSON response similar to the following:

```
{
    "JobName": "job-name",
    "JobType": "Default",
    "JobArn": "arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id",
    "RoleArn": "iam-role-arn",
    "Status": "COMPLETED",
    "CreationTime": 1664922919.2,
    "LastModifiedTime": 1664924208.291,
    "InputConfig": {
        "ModelPackageVersionArn": "arn:aws:sagemaker:region:account-id:model-package/resource-id",
        "Endpoints": [
            {
                "EndpointName": "endpoint-name"
            }
        ]
    },
    "InferenceRecommendations": [
        {
            "Metrics": {
                "CostPerHour": 0.7360000014305115,
                "CostPerInference": 7.456940238625975e-06,
                "MaxInvocations": 1645,
                "ModelLatency": 171
            },
            "EndpointConfiguration": {
                "EndpointName": "sm-endpoint-name",
                "VariantName": "variant-name",
                "InstanceType": "ml.g4dn.xlarge",
                "InitialInstanceCount": 1
            },
            "ModelConfiguration": {
                "EnvironmentParameters": [
                    {
                        "Key": "TS_DEFAULT_WORKERS_PER_MODEL",
                        "ValueType": "string",
                        "Value": "4"
                    }
                ]
            }
        }
    ],
    "EndpointPerformances": [
        {
            "Metrics": {
                "MaxInvocations": 184,
                "ModelLatency": 1312
            },
            "EndpointConfiguration": {
                "EndpointName": "endpoint-name"
            }
        }
    ]
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and latest modification times.

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) that was used during the recommendation job.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in milliseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The `EndpointPerformances` nested dictionary contains the name of your existing endpoint on which the recommendation job was run (`EndpointName`) and the performance metrics for your endpoint (`MaxInvocations` and `ModelLatency`).

# Stop your inference recommendation
<a name="instance-recommendation-stop"></a>

You might want to stop a job that is currently running if you began a job by mistake or no longer need to run the job. Stop your Inference Recommender inference recommendation jobs programmatically with the `StopInferenceRecommendationsJob` API or with Studio Classic.

------
#### [ AWS SDK for Python (Boto3) ]

Specify the name of the inference recommendation job for the `JobName` field:

```
sagemaker_client.stop_inference_recommendations_job(
                                    JobName='<INSERT>'
                                    )
```

------
#### [ AWS CLI ]

Specify the job name of the inference recommendation job for the `job-name` flag:

```
aws sagemaker stop-inference-recommendations-job --job-name <job-name>
```

------
#### [ Amazon SageMaker Studio Classic ]

Close the tab in which you initiated the inference recommendation to stop your Inference Recommender inference recommendation.

------
#### [ SageMaker AI console ]

To stop your instance recommendation job through the SageMaker AI console, do the following:



1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, select your instance recommendation job.

1. Choose **Stop job**.

1. In the dialog box that pops up, choose **Confirm**.

After stopping your job, the job’s **Status** should change to **Stopping**.

------

# Compiled recommendations with Neo
<a name="inference-recommender-neo-compilation"></a>

In Inference Recommender, you can compile your model with Neo and get endpoint recommendations for your compiled model. [SageMaker Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) is a service that can optimize your model for a target hardware platform (that is, a specific instance type or environment). Optimizing a model with Neo might improve the performance of your hosted model.

For Neo-supported frameworks and containers, Inference Recommender automatically suggests Neo-optimized recommendations. To be eligible for Neo compilation, your input must meet the following prerequisites:
+ You are using a SageMaker AI owned [DLC ](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html) or XGBoost container.
+ You are using a framework version supported by Neo. For the framework versions supported by Neo, see [Cloud Instances](neo-supported-cloud.md#neo-supported-cloud-instances) in the SageMaker Neo documentation.
+ Neo requires that you provide a correct input data shape for your model. You can specify this data shape as the `[DataInputConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelInput.html#sagemaker-Type-ModelInput-DataInputConfig)` in the `[InferenceSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-InferenceSpecification)` when you create a model package. For information about the correct data shapes for each framework, see [Prepare Model for Compilation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-compilation-preparing-model.html) in the SageMaker Neo documentation.

  The following example shows how to specify the `DataInputConfig` field in the `InferenceSpecification`, where `data_input_configuration` is a variable that contains the data shape in dictionary format (for example, `{'input':[1,1024,1024,3]}`).

  ```
  "InferenceSpecification": {
          "Containers": [
              {
                  "Image": dlc_uri,
                  "Framework": framework.upper(),
                  "FrameworkVersion": framework_version,
                  "NearestModelName": model_name,
                  "ModelInput": {"DataInputConfig": data_input_configuration},
              }
          ],
          "SupportedContentTypes": input_mime_types,  # required, must be non-null
          "SupportedResponseMIMETypes": [],
          "SupportedRealtimeInferenceInstanceTypes": supported_realtime_inference_types,  # optional
      }
  ```

If these conditions are met in your request, then Inference Recommender runs scenarios for both compiled and uncompiled versions of your model, giving you multiple recommendation combinations to choose from. You can compare the configurations for compiled and uncompiled versions of the same inference recommendation and determine which one best suits your use case. The recommendations are ranked by cost per inference.

To get the Neo compilation recommendations, you don’t have to do any additional configuration besides making sure that your input meets the preceding requirements. Inference Recommender automatically runs Neo compilation on your model if your input meets the requirements, and you receive a response that includes Neo recommendations.

If you run into errors during your Neo compilation, see [Troubleshoot Neo Compilation Errors](neo-troubleshooting-compilation.md).

The following table is an example of a response you might get from an Inference Recommender job that includes recommendations for compiled models. If the `InferenceSpecificationName` field is `None`, then the recommendation is an uncompiled model. The last row, in which the value for the **InferenceSpecificationName** field is `neo-00011122-2333-4445-5566-677788899900`, is for a model compiled with Neo. The value in the field is the name of the Neo job used to compile and optimize your model.


| EndpointName | InstanceType | InitialInstanceCount | EnvironmentParameters | CostPerHour | CostPerInference | MaxInvocations | ModelLatency | InferenceSpecificationName | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sm-epc-example-000111222 | ml.c5.9xlarge | 1 | [] | 1.836 | 9.15E-07 | 33456 | 7 | None | 
| sm-epc-example-111222333 | ml.c5.2xlarge | 1 | [] | 0.408 | 2.11E-07 | 32211 | 21 | None | 
| sm-epc-example-222333444 | ml.c5.xlarge | 1 | [] | 0.204 | 1.86E-07 | 18276 | 92 | None | 
| sm-epc-example-333444555 | ml.c5.xlarge | 1 | [] | 0.204 | 1.60E-07 | 21286 | 42 | neo-00011122-2333-4445-5566-677788899900 | 

## Get started
<a name="inference-recommender-neo-compilation-get-started"></a>

The general steps for creating an Inference Recommender job that includes Neo-optimized recommendations are as follows:
+ Prepare your ML model for compilation. For more information, see [Prepare Model for Compilation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-compilation-preparing-model.html) in the Neo documentation.
+ Package your model in a model archive (`.tar.gz` file).
+ Create a sample payload archive.
+ Register your model in SageMaker Model Registry.
+ Create an Inference Recommender job.
+ View the results of the Inference Recommender job and choose a configuration.
+ Debug compilation failures, if any. For more information, see [Troubleshoot Neo Compilation Errors](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html).

For an example that demonstrates the previous workflow and how to get Neo-optimized recommendations using XGBoost, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/xgboost/xgboost-inference-recommender.ipynb). For an example that show how to get Neo-optimized recommendations using TensorFlow, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/inference-recommender.ipynb).

# Recommendation results
<a name="inference-recommender-interpret-results"></a>

Each Inference Recommender job result includes `InstanceType`, `InitialInstanceCount`, and `EnvironmentParameters`, which are tuned environment variable parameters for your container to improve its latency and throughput. The results also include performance and cost metrics such as `MaxInvocations`, `ModelLatency`, `CostPerHour`, `CostPerInference`, `CpuUtilization`, and `MemoryUtilization`.

In the table below we provide a description of these metrics. These metrics can help you narrow down your search for the best endpoint configuration that suits your use case. For example, if your motivation is overall price performance with an emphasis on throughput, then you should focus on `CostPerInference`. 


| Metric | Description | Use case | 
| --- | --- | --- | 
|  `ModelLatency`  |  The interval of time taken by a model to respond as viewed from SageMaker AI. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. Units: Milliseconds  | Latency sensitive workloads such as ad serving and medical diagnosis | 
|  `MaximumInvocations`  |  The maximum number of `InvokeEndpoint` requests sent to a model endpoint in a minute. Units: None  | Throughput-focused workloads such as video processing or batch inference | 
|  `CostPerHour`  |  The estimated cost per hour for your real-time endpoint. Units: US Dollars  | Cost sensitive workloads with no latency deadlines | 
|  `CostPerInference`  |  The estimated cost per inference call for your real-time endpoint. Units: US Dollars  | Maximize overall price performance with a focus on throughput | 
|  `CpuUtilization`  |  The expected CPU utilization at maximum invocations per minute for the endpoint instance. Units: Percent  | Understand instance health during benchmarking by having visibility into core CPU utilization of the instance | 
|  `MemoryUtilization`  |  The expected memory utilization at maximum invocations per minute for the endpoint instance. Units: Percent  | Understand instance health during benchmarking by having visibility into core memory utilization of the instance | 

In some cases you might want to explore other [SageMaker AI Endpoint Invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation) such as `CPUUtilization`. Every Inference Recommender job result includes the names of endpoints spun up during the load test. You can use CloudWatch to review the logs for these endpoints even after they’ve been deleted.

The following image is an example of CloudWatch metrics and charts you can review for a single endpoint from your recommendation result. This recommendation result is from a Default job. The way to interpret the scalar values from the recommendation results is that they are based on the time point when the Invocations graph first begins to level out. For example, the `ModelLatency` value reported is at the beginning of the plateau around `03:00:31`.

![\[Charts for CloudWatch metrics.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference-recommender-cw-metrics.png)


For full descriptions of the CloudWatch metrics used in the preceding charts, see [SageMaker AI Endpoint Invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation).

You can also see performance metrics like `ClientInvocations` and `NumberOfUsers` published by Inference Recommender in the `/aws/sagemaker/InferenceRecommendationsJobs` namespace. For a full list of metrics and descriptions published by Inference Recommender, see [SageMaker Inference Recommender jobs metrics](monitoring-cloudwatch.md#cloudwatch-metrics-inference-recommender).

See the [Amazon SageMaker Inference Recommender - CloudWatch Metrics](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/tensorflow-cloudwatch/tf-cloudwatch-inference-recommender.ipynb) Jupyter notebook in the [amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) Github repository for an example of how to use the AWS SDK for Python (Boto3) to explore CloudWatch metrics for your endpoints.

# Get autoscaling policy recommendations
<a name="inference-recommender-autoscaling"></a>

With Amazon SageMaker Inference Recommender, you can get recommendations for autoscaling policies for your SageMaker AI endpoint based on your anticipated traffic pattern. If you’ve already completed an inference recommendation job, you can provide the details of the job to get a recommendation for an autoscaling policy that you can apply to your endpoint.

Inference Recommender benchmarks different values for each metric to determine the ideal autoscaling configuration for your endpoint. The autoscaling recommendation returns a recommended autoscaling policy for each metric that was defined in your inference recommendation job. You can save the policies and apply them to your endpoint with the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API.

To get started, review the following prerequisites.

## Prerequisites
<a name="inference-recommender-autoscaling-prereqs"></a>

Before you begin, you must have completed a successful inference recommendation job. In the following section, you can provide either an inference recommendation ID or the name of a SageMaker AI endpoint that was benchmarked during an inference recommendation job.

To retrieve your recommendation job ID or endpoint name, you can either view the details of your inference recommendation job in the SageMaker AI console, or you can use the `RecommendationId` or `EndpointName` fields returned by the [DescribeInferenceRecommendationsJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeInferenceRecommendationsJob.html) API.

## Create an autoscaling configuration recommendation
<a name="inference-recommender-autoscaling-create"></a>

To create an autoscaling recommendation policy, you can use the AWS SDK for Python (Boto3).

The following example shows the fields for the [ GetScalingConfigurationRecommendation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_GetScalingConfigurationRecommendation.html) API. Use the following fields when you call the API:
+ `InferenceRecommendationsJobName` – Enter the name of your inference recommendation job.
+ `RecommendationId` – Enter the ID of an inference recommendation from a recommendation job. This is optional if you’ve specified the `EndpointName` field.
+ `EndpointName` – Enter the name of an endpoint that was benchmarked during an inference recommendation job. This is optional if you’ve specified the `RecommendationId` field.
+ `TargetCpuUtilizationPerCore` – (Optional) Enter a percentage value of how much utilization you want an instance on your endpoint to use before autoscaling. The default value if you don’t specify this field is 50%.
+ `ScalingPolicyObjective` – (Optional) An object where you specify your anticipated traffic pattern.
  + `MinInvocationsPerMinute` – (Optional) The minimum number of expected requests to your endpoint per minute.
  + `MaxInvocationsPerMinute` – (Optional) The maximum number of expected requests to your endpoint per minute.

```
{
    "InferenceRecommendationsJobName": "string", // Required
    "RecommendationId": "string", // Optional, provide one of RecommendationId or EndpointName
    "EndpointName": "string", // Optional, provide one of RecommendationId or EndpointName
    "TargetCpuUtilizationPerCore": number, // Optional
    "ScalingPolicyObjective": { // Optional
        "MinInvocationsPerMinute": number,
        "MaxInvocationsPerMinute": number
    }
}
```

After submitting your request, you’ll receive a response with autoscaling policies defined for each metric. See the following section for information about interpreting the response.

## Review your autoscaling configuration recommendation results
<a name="inference-recommender-autoscaling-review"></a>

The following example shows the response from the [ GetScalingConfigurationRecommendation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_GetScalingConfigurationRecommendation.html) API:

```
{
    "InferenceRecommendationsJobName": "string", 
    "RecommendationId": "string", // One of RecommendationId or EndpointName is shown
    "EndpointName": "string", 
    "TargetUtilizationPercentage": Integer,
    "ScalingPolicyObjective": { 
        "MinInvocationsPerMinute": Integer, 
        "MaxInvocationsPerMinute": Integer
    },
    "Metric": {
        "ModelLatency": Integer,
        "InvocationsPerInstance": Integer
    },
    "DynamicScalingConfiguration": {
        "MinCapacity": number,
        "MaxCapacity": number, 
        "ScaleInCooldown": number,
        "ScaleOutCooldown": number,
        "ScalingPolicies": [
            {
                "TargetTracking": {
                    "MetricSpecification": {
                        "Predefined" {
                            "PredefinedMetricType": "string"
                         },
                        "Customized": {
                            "MetricName": "string",
                            "Namespace": "string",
                            "Statistic": "string"
                         }
                    },
                    "TargetValue": Double
                } 
            }
        ]
    }
}
```

The `InferenceRecommendationsJobName`, `RecommendationID` or `EndpointName`, `TargetCpuUtilizationPerCore`, and the `ScalingPolicyObjective` object fields are copied from your initial request.

The `Metric` object lists the metrics that were benchmarked in your inference recommendation job, along with a calculation of the values for each metric when the instance utilization would be the same as the `TargetCpuUtilizationPerCore` value. This is useful for anticipating the performance metrics on your endpoint when it scales in and out with the recommended autoscaling policy. For example, consider if your instance utilization was 50% in your inference recommendation job and your `InvocationsPerInstance` value was originally `4`. If you specify the `TargetCpuUtilizationPerCore` value to be 100% in your autoscaling recommendation request, then the `InvocationsPerInstance` metric value returned in the response is `2` because you anticipated allocating twice as much instance utilization.

The `DynamicScalingConfiguration` object returns the values that you should specify for the [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html#autoscaling-PutScalingPolicy-request-TargetTrackingScalingPolicyConfiguration) when you call the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API. This includes the recommended minimum and maximum capacity values, the recommended scale in and scale out cooldown times, and the `ScalingPolicies` object, which contains the recommended `TargetValue` you should specify for each metric.

# Run a custom load test
<a name="inference-recommender-load-test"></a>

Amazon SageMaker Inference Recommender load tests conduct extensive benchmarks based on production requirements for latency and throughput, custom traffic patterns, and either serverless endpoints or real-time instances (up to 10) that you select.

The following sections demonstrate how to create, describe, and stop a load test programmatically using the AWS SDK for Python (Boto3) and the AWS CLI, or interactively using Amazon SageMaker Studio Classic or the SageMaker AI console.

## Create a load test job
<a name="load-test-create"></a>

Create a load test programmatically using the AWS SDK for Python (Boto3), with the AWS CLI, or interactively using Studio Classic or the SageMaker AI console. As with Inference Recommender inference recommendations, specify a job name for your load test, an AWS IAM role ARN, an input configuration, and your model package ARN from when you registered your model with the model registry. Load tests require that you also specify a traffic pattern and stopping conditions.

------
#### [ AWS SDK for Python (Boto3) ]

Use the `CreateInferenceRecommendationsJob` API to create an Inference Recommender load test. Specify `Advanced` for the `JobType` field and provide: 
+ A job name for your load test (`JobName`). The job name must be unique within your AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ An endpoint configuration dictionary (`InputConfig`) where you specify the following:
  + For `TrafficPattern`, specify either the phases or stairs traffic pattern. With the phases traffic pattern, new users spawn every minute at the rate you specify. With the stairs traffic pattern, new users spawn at timed intervals (or *steps*) at a rate you specify. Choose one of the following:
    + For `TrafficType`, specify `PHASES`. Then, for the `Phases` array, specify the `InitialNumberOfUsers` (how many concurrent users to start with, with a minimum of 1 and a maximum of 3), `SpawnRate` (the number of users to be spawned in a minute for a specific phase of load testing, with a minimum of 0 and maximum of 3), and `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600).
    + For `TrafficType`, specify `STAIRS`. Then, for the `Stairs` array, specify the `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600), `NumberOfSteps` (how many intervals are used during the phase), and `UsersPerStep` (how many users are added during each interval). Note that the length of each step is the value of `DurationInSeconds / NumberOfSteps`. For example, if your `DurationInSeconds` is `600` and you specify `5` steps, then each step is 120 seconds long.
**Note**  
A user is defined as a system-generated actor that runs in a loop and invokes requests to an endpoint as part of Inference Recommender. For a typical XGBoost container running on an `ml.c5.large` instance, endpoints can reach 30,000 invocations per minute (500 tps) with just 15-20 users.
  + For `ResourceLimit`, specify `MaxNumberOfTests` (the maximum number of benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10) and `MaxParallelOfTests` (the maximum number of parallel benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10).
  + For `EndpointConfigurations`, you can specify one of the following:
    + The `InstanceType` field, where you specify the instance type on which you want to run your load tests.
    + The `ServerlessConfig`, in which you specify your ideal values for `MaxConcurrency` and `MemorySizeInMB` for a serverless endpoint. For more information, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).
+ A stopping conditions dictionary (`StoppingConditions`), where if any of the conditions are met, the Inference Recommender job stops. For this example, specify the following fields in the dictionary:
  + For `MaxInvocations`, specify the maximum number of requests per minute expected for the endpoint, with a minimum of 1 and a maximum of 30,000.
  + For `ModelLatencyThresholds`, specify `Percentile` (the model latency percentile threshold) and `ValueInMilliseconds` (the model latency percentile value in milliseconds).
  + (Optional) For `FlatInvocations`, you can specify whether to continue the load test when the TPS (invocations per minute) rate flattens. A flattened TPS rate usually means that the endpoint has reached capacity. However, you might want to continue monitoring the endpoint under full capacity conditions. To continue the load test when this happens, specify this value as `Continue`. Otherwise, the default value is `Stop`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region=<INSERT>
sagemaker_client=boto3.client('sagemaker', region=aws_region) 
                
# Provide a name to your recommendation based on load testing
load_test_job_name="<INSERT>"

# Provide the name of the sagemaker instance type
instance_type="<INSERT>"

# Provide the IAM Role that gives SageMaker permission to access AWS services 
role_arn='arn:aws:iam::<account>:role/*'

# Provide your model package ARN that was created when you registered your 
# model with Model Registry
model_package_arn='arn:aws:sagemaker:<region>:<account>:role/*'

sagemaker_client.create_inference_recommendations_job(
                        JobName=load_test_job_name,
                        JobType="Advanced",
                        RoleArn=role_arn,
                        InputConfig={
                            'ModelPackageVersionArn': model_package_arn,
                            "JobDurationInSeconds": 7200,
                            'TrafficPattern' : {
                                # Replace PHASES with STAIRS to use the stairs traffic pattern
                                'TrafficType': 'PHASES',
                                'Phases': [
                                    {
                                        'InitialNumberOfUsers': 1,
                                        'SpawnRate': 1,
                                        'DurationInSeconds': 120
                                    },
                                    {
                                        'InitialNumberOfUsers': 1,
                                        'SpawnRate': 1,
                                        'DurationInSeconds': 120
                                    }
                                ]
                                # Uncomment this section and comment out the Phases object above to use the stairs traffic pattern
                                # 'Stairs' : {
                                #   'DurationInSeconds': 240,
                                #   'NumberOfSteps': 2,
                                #   'UsersPerStep': 2
                                # }
                            },
                            'ResourceLimit': {
                                        'MaxNumberOfTests': 10,
                                        'MaxParallelOfTests': 3
                                },
                            "EndpointConfigurations" : [{
                                        'InstanceType': 'ml.c5.xlarge'
                                    },
                                    {
                                        'InstanceType': 'ml.m5.xlarge'
                                    },
                                    {
                                        'InstanceType': 'ml.r5.xlarge'
                                    }]
                                    # Uncomment the ServerlessConfig and comment out the InstanceType field if you want recommendations for a serverless endpoint
                                    # "ServerlessConfig": {
                                    #     "MaxConcurrency": value, 
                                    #     "MemorySizeInMB": value 
                                    # }
                        },
                        StoppingConditions={
                            'MaxInvocations': 1000,
                            'ModelLatencyThresholds':[{
                                'Percentile': 'P95', 
                                'ValueInMilliseconds': 100
                            }],
                            # Change 'Stop' to 'Continue' to let the load test continue if invocations flatten 
                            'FlatInvocations': 'Stop'
                        }
                )
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to `CreateInferenceRecommendationsJob`.

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to create an Inference Recommender load test. Specify `Advanced` for the `JobType` field and provide: 
+ A job name for your load test (`job-name`). The job name must be unique within your AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ An endpoint configuration dictionary (`input-config`) where you specify the following:
  + For `TrafficPattern`, specify either the phases or stairs traffic pattern. With the phases traffic pattern, new users spawn every minute at the rate you specify. With the stairs traffic pattern, new users spawn at timed intervals (or *steps*) at a rate you specify. Choose one of the following:
    + For `TrafficType`, specify `PHASES`. Then, for the `Phases` array, specify the `InitialNumberOfUsers` (how many concurrent users to start with, with a minimum of 1 and a maximum of 3), `SpawnRate` (the number of users to be spawned in a minute for a specific phase of load testing, with a minimum of 0 and maximum of 3), and `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600).
    + For `TrafficType`, specify `STAIRS`. Then, for the `Stairs` array, specify the `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600), `NumberOfSteps` (how many intervals are used during the phase), and `UsersPerStep` (how many users are added during each interval). Note that the length of each step is the value of `DurationInSeconds / NumberOfSteps`. For example, if your `DurationInSeconds` is `600` and you specify `5` steps, then each step is 120 seconds long.
**Note**  
A user is defined as a system-generated actor that runs in a loop and invokes requests to an endpoint as part of Inference Recommender. For a typical XGBoost container running on an `ml.c5.large` instance, endpoints can reach 30,000 invocations per minute (500 tps) with just 15-20 users.
  + For `ResourceLimit`, specify `MaxNumberOfTests` (the maximum number of benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10) and `MaxParallelOfTests` (the maximum number of parallel benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10).
  + For `EndpointConfigurations`, you can specify one of the following:
    + The `InstanceType` field, where you specify the instance type on which you want to run your load tests.
    + The `ServerlessConfig`, in which you specify your ideal values for `MaxConcurrency` and `MemorySizeInMB` for a serverless endpoint.
+ A stopping conditions dictionary (`stopping-conditions`), where if any of the conditions are met, the Inference Recommender job stops. For this example, specify the following fields in the dictionary:
  + For `MaxInvocations`, specify the maximum number of requests per minute expected for the endpoint, with a minimum of 1 and a maximum of 30,000.
  + For `ModelLatencyThresholds`, specify `Percentile` (the model latency percentile threshold) and `ValueInMilliseconds` (the model latency percentile value in milliseconds).
  + (Optional) For `FlatInvocations`, you can specify whether to continue the load test when the TPS (invocations per minute) rate flattens. A flattened TPS rate usually means that the endpoint has reached capacity. However, you might want to continue monitoring the endpoint under full capacity conditions. To continue the load test when this happens, specify this value as `Continue`. Otherwise, the default value is `Stop`.

```
aws sagemaker create-inference-recommendations-job\
    --region <region>\
    --job-name <job-name>\
    --job-type ADVANCED\
    --role-arn arn:aws:iam::<account>:role/*\
    --input-config \"{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region>:<account>:role/*\",
        \"JobDurationInSeconds\": 7200,                                
        \"TrafficPattern\" : {
                # Replace PHASES with STAIRS to use the stairs traffic pattern
                \"TrafficType\": \"PHASES\",
                \"Phases\": [
                    {
                        \"InitialNumberOfUsers\": 1,
                        \"SpawnRate\": 60,
                        \"DurationInSeconds\": 300
                    }
                ]
                # Uncomment this section and comment out the Phases object above to use the stairs traffic pattern
                # 'Stairs' : {
                #   'DurationInSeconds': 240,
                #   'NumberOfSteps': 2,
                #   'UsersPerStep': 2
                # }
            },
            \"ResourceLimit\": {
                \"MaxNumberOfTests\": 10,
                \"MaxParallelOfTests\": 3
            },
            \"EndpointConfigurations\" : [
                {
                    \"InstanceType\": \"ml.c5.xlarge\"
                },
                {
                    \"InstanceType\": \"ml.m5.xlarge\"
                },
                {
                    \"InstanceType\": \"ml.r5.xlarge\"
                }
                # Use the ServerlessConfig and leave out the InstanceType fields if you want recommendations for a serverless endpoint
                # \"ServerlessConfig\": {
                #     \"MaxConcurrency\": value, 
                #     \"MemorySizeInMB\": value 
                # }
            ]
        }\"
    --stopping-conditions \"{
        \"MaxInvocations\": 1000,
        \"ModelLatencyThresholds\":[
                {
                    \"Percentile\": \"P95\", 
                    \"ValueInMilliseconds\": 100
                }
        ],
        # Change 'Stop' to 'Continue' to let the load test continue if invocations flatten 
        \"FlatInvocations\": \"Stop\"
    }\"
```

------
#### [ Amazon SageMaker Studio Classic ]

Create a load test with Studio Classic.

1. In your Studio Classic application, choose the home icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)).

1. In the left sidebar of Studio Classic, choose **Deployments**.

1. Choose **Inference recommender** from the dropdown list.

1. Choose **Create inference recommender job**. A new tab titled **Create inference recommender job** opens.

1. Select the name of your model group from the dropdown **Model group** field. The list includes all the model groups registered with the model registry in your account, including models registered outside of Studio Classic.

1. Select a model version from the dropdown **Model version** field.

1. Choose **Continue**.

1. Provide a name for the job in the **Name** field.

1. (Optional) Provide a description of your job in the **Description** field.

1. Choose an IAM role that grants Inference Recommender permission to access AWS services. You can create a role and attach the `AmazonSageMakerFullAccess` IAM managed policy to accomplish this, or you can let Studio Classic create a role for you.

1. Choose **Stopping Conditions** to expand the available input fields. Provide a set of conditions for stopping a deployment recommendation. 

   1. Specify the maximum number of requests per minute expected for the endpoint in the **Max Invocations Per Minute** field.

   1. Specify the model latency threshold in microseconds in the **Model Latency Threshold** field. The **Model Latency Threshold** depicts the interval of time taken by a model to respond as viewed from Inference Recommender. The interval includes the local communication time taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

1. Choose **Traffic Pattern** to expand the available input fields.

   1. Set the initial number of virtual users by specifying an integer in the **Initial Number of Users** field.

   1. Provide an integer number for the **Spawn Rate** field. The spawn rate sets the number of users created per second.

   1. Set the duration for the phase in seconds by specifying an integer in the **Duration** field.

   1. (Optional) Add additional traffic patterns. To do so, choose **Add**.

1. Choose the **Additional** setting to reveal the **Max test duration** field. Specify, in seconds, the maximum time a test can take during a job. New jobs are not scheduled after the defined duration. This helps ensure jobs that are in progress are not stopped and that you only view completed jobs.

1. Choose **Continue**.

1. Choose **Selected Instances**.

1. In the **Instances for benchmarking** field, choose **Add instances to test**. Select up to 10 instances for Inference Recommender to use for load testing.

1. Choose **Additional settings**.

   1. Provide an integer that sets an upper limit on the number of tests a job can make for the **Max number of tests field**. Note that each endpoint configuration results in a new load test.

   1. Provide an integer for the **Max parallel** test field. This setting defines an upper limit on the number of load tests that can run in parallel.

1. Choose **Submit**.

   The load test can take up to 2 hours.
**Warning**  
Do not close this tab. If you close this tab, you cancel the Inference Recommender load test job.

------
#### [ SageMaker AI console ]

Create a custom load test through the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose **Create job**.

1. For **Step 1: Model configuration**, do the following:

   1. For **Job type**, choose **Advanced recommender job**.

   1. If you’re using a model registered in the SageMaker AI model registry, then turn on the **Choose a model from the model registry** toggle and do the following:

      1. For the **Model group** dropdown list, choose the model group in SageMaker AI model registry where your model is.

      1. For the **Model version** dropdown list, choose the desired version of your model.

   1. If you’re using a model that you’ve created in SageMaker AI, then turn off the **Choose a model from the model registry** toggle and do the following:

      1. For the **Model name** field, enter the name of your SageMaker AI model.

   1. For **IAM role**, you can select an existing AWS IAM role that has the necessary permissions to create an instance recommendation job. Alternatively, if you don’t have an existing role, you can choose **Create a new role** to open the role creation pop-up, and SageMaker AI adds the necessary permissions to the new role that you create.

   1. For **S3 bucket for benchmarking payload**, enter the Amazon S3 path to your sample payload archive, which should contain sample payload files that Inference Recommender uses to benchmark your model on different instance types.

   1. For **Payload content type**, enter the MIME types of your sample payload data.

   1. For **Traffic pattern**, configure phases for the load test by doing the following:

      1. For **Initial number of users**, specify how many concurrent users you want to start with (with a minimum of 1 and a maximum of 3).

      1. For **Spawn rate**, specify the number of users to be spawned in a minute for the phase (with a minimum of 0 and a maximum of 3).

      1. For **Duration (seconds)**, specify how low the traffic phase should be in seconds (with a minimum of 120 and a maximum of 3600).

   1. (Optional) If you turned off the **Choose a model from the model registry toggle** and specified a SageMaker AI model, then for **Container configuration**, do the following:

      1. For the **Domain** dropdown list, select the machine learning domain of the model, such as computer vision, natural language processing, or machine learning.

      1. For the **Framework** dropdown list, select the framework of your container, such as TensorFlow or XGBoost.

      1. For **Framework version**, enter the framework version of your container image.

      1. For the **Nearest model name** dropdown list, select the pre-trained model that mostly closely matches your own.

      1. For the **Task** dropdown list, select the machine learning task that the model accomplishes, such as image classification or regression.

   1. (Optional) For **Model compilation using SageMaker Neo**, you can configure the recommendation job for a model that you’ve compiled using SageMaker Neo. For **Data input configuration**, enter the correct input data shape for your model in a format similar to `{'input':[1,1024,1024,3]}`.

   1. Choose **Next**.

1. For **Step 2: Instances and environment parameters**, do the following:

   1. For **Select instances for benchmarking**, select up to 8 instance types that you want to benchmark against.

   1. (Optional) For **Environment parameter ranges**, you can specify environment parameters that help optimize your model. Specify the parameters as **Key** and **Value** pairs.

   1. Choose **Next**.

1. For **Step 3: Job parameters**, do the following:

   1. (Optional) For the **Job name** field, enter a name for your instance recommendation job. When you create the job, SageMaker AI appends a timestamp to the end of this name.

   1. (Optional) For the **Job description** field, enter a description for the job.

   1. (Optional) For the **Encryption key** dropdown list, choose an AWS KMS key by name or enter its ARN to encrypt your data.

   1. (Optional) For **Max number of tests**, enter the number of test that you want to run during the recommendation job.

   1. (Optional) For **Max parallel tests**, enter the maximum number of parallel tests that you want to run during the recommendation job.

   1. For **Max test duration (s)**, enter the maximum number of seconds you want each test to run for.

   1. For **Max invocations per minute**, enter the maximum number of requests per minute the endpoint can reach before stopping the recommendation job. After reaching this limit, SageMaker AI ends the job.

   1. For **P99 Model latency threshold (ms)**, enter the model latency percentile in milliseconds.

   1. Choose **Next**.

1. For **Step 4: Review job**, review your configurations and then choose **Submit**.

------

## Get your load test results
<a name="load-test-describe"></a>

You can programmatically collect metrics across all load tests once the load tests are done with AWS SDK for Python (Boto3), the AWS CLI, Studio Classic, or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Collect metrics with the `DescribeInferenceRecommendationsJob` API. Specify the job name of the load test for the `JobName` field:

```
load_test_response = sagemaker_client.describe_inference_recommendations_job(
                                                        JobName=load_test_job_name
                                                        )
```

Print the response object.

```
load_test_response['Status']
```

This returns a JSON response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Advanced', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 19, 38, 30, 957000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 19, 46, 31, 399000, tzinfo=tzlocal()), 
    'InputConfig': {
        'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
        'JobDurationInSeconds': 7200, 
        'TrafficPattern': {
            'TrafficType': 'PHASES'
            }, 
        'ResourceLimit': {
            'MaxNumberOfTests': 100, 
            'MaxParallelOfTests': 100
            }, 
        'EndpointConfigurations': [{
            'InstanceType': 'ml.c5d.xlarge'
            }]
        }, 
    'StoppingConditions': {
        'MaxInvocations': 1000, 
        'ModelLatencyThresholds': [{
            'Percentile': 'P95', 
            'ValueInMilliseconds': 100}
            ]}, 
    'InferenceRecommendations': [{
        'Metrics': {
            'CostPerHour': 0.6899999976158142, 
            'CostPerInference': 1.0332434612791985e-05, 
            'MaximumInvocations': 1113, 
            'ModelLatency': 100000
            }, 
    'EndpointConfiguration': {
        'EndpointName': 'endpoint-name', 
        'VariantName': 'variant-name', 
        'InstanceType': 'ml.c5d.xlarge', 
        'InitialInstanceCount': 3
        }, 
    'ModelConfiguration': {
        'Compiled': False, 
        'EnvironmentParameters': []
        }
    }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1199', 
            'date': 'Tue, 26 Oct 2021 19:57:42 GMT'
            }, 
        'RetryAttempts': 0}
    }
```

The first few lines provide information about the load test job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `EndpointConfiguration` nested dictionary also contains the instance count (`InitialInstanceCount`) recommendation. This is the number of instances that you should provision in the endpoint to meet the `MaxInvocations` specified in the `StoppingConditions`. For example, if the `InstanceType` is `ml.m5.large` and the `InitialInstanceCount` is `2`, then you should provision 2 `ml.m5.large` instances for your endpoint so that it can handle the TPS specified in the `MaxInvocations` stopping condition.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) for your real-time endpoint, the maximum number of `InvokeEndpoint` requests sent to the endpoint, and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for a load test job that was configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the values you specified for `MaxConcurrency` and `MemorySizeInMB` when setting up the load test. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch compute resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ AWS CLI ]

Collect metrics with the `describe-inference-recommendations-job` API. Specify the job name of the load test for the `job-name` flag:

```
aws sagemaker describe-inference-recommendations-job --job-name <job-name>
```

This returns a response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing Serverless Inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Advanced', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 19, 38, 30, 957000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 19, 46, 31, 399000, tzinfo=tzlocal()), 
    'InputConfig': {
        'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
        'JobDurationInSeconds': 7200, 
        'TrafficPattern': {
            'TrafficType': 'PHASES'
            }, 
        'ResourceLimit': {
            'MaxNumberOfTests': 100, 
            'MaxParallelOfTests': 100
            }, 
        'EndpointConfigurations': [{
            'InstanceType': 'ml.c5d.xlarge'
            }]
        }, 
    'StoppingConditions': {
        'MaxInvocations': 1000, 
        'ModelLatencyThresholds': [{
            'Percentile': 'P95', 
            'ValueInMilliseconds': 100
            }]
        }, 
    'InferenceRecommendations': [{
        'Metrics': {
        'CostPerHour': 0.6899999976158142, 
        'CostPerInference': 1.0332434612791985e-05, 
        'MaximumInvocations': 1113, 
        'ModelLatency': 100000
        }, 
        'EndpointConfiguration': {
            'EndpointName': 'endpoint-name', 
            'VariantName': 'variant-name', 
            'InstanceType': 'ml.c5d.xlarge', 
            'InitialInstanceCount': 3
            }, 
        'ModelConfiguration': {
            'Compiled': False, 
            'EnvironmentParameters': []
            }
        }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1199', 
            'date': 'Tue, 26 Oct 2021 19:57:42 GMT'
            }, 
        'RetryAttempts': 0
        }
    }
```

The first few lines provide information about the load test job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) for your real-time endpoint, the maximum number of `InvokeEndpoint` requests sent to the endpoint, and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for a load test job that was configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the values you specified for `MaxConcurrency` and `MemorySizeInMB` when setting up the load test. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ Amazon SageMaker Studio Classic ]

The recommendations populate in a new tab called **Inference recommendations** within Studio Classic. It can take up to 2 hours for the results to show up. This tab contains **Results** and **Details** columns.

The **Details** column provides information about the load test job, such as the name given to the load test job, when the job was created (**Creation time**), and more. It also contains **Settings** information, such as the maximum number of invocation that occurred per minute and information about the Amazon Resource Names used.

The **Results** column provides ** Deployment goals** and **SageMaker AI recommendations** windows in which you can adjust the order in which results are displayed based on deployment importance. There are three dropdown menus in which you can provide the level of importance of the **Cost**, **Latency**, and **Throughput** for your use case. For each goal (cost, latency, and throughput), you can set the level of importance: **Lowest Importance**, **Low Importance**, **Moderate importance**, **High importance**, or **Highest importance**. 

Based on your selections of importance for each goal, Inference Recommender displays its top recommendation in the **SageMaker recommendation** field on the right of the panel, along with the estimated cost per hour and inference request. It also provides Information about the expected model latency, maximum number of invocations, and the number of instances.

In addition to the top recommendation displayed, you can also see the same information displayed for all instances that Inference Recommender tested in the **All runs** section.

------
#### [ SageMaker AI console ]

You can view your custom load test job results in the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose the name of your inference recommendation job.

On the details page for your job, you can view the **Inference recommendations**, which are the instance types SageMaker AI recommends for your model, as shown in the following screenshot.

![\[Screenshot of the inference recommendations list on the job details page in the SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-instant-recs.png)


In this section, you can compare the instance types by various factors such as **Model latency**, **Cost per hour**, **Cost per inference**, and **Invocations per minute**.

On this page, you can also view the configurations you specified for your job. In the **Monitor** section, you can view the Amazon CloudWatch metrics that were logged for each instance type. To learn more about interpreting these metrics, see [Interpret results](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html).

------

# Stop your load test
<a name="load-test-stop"></a>

You might want to stop a job that is currently running if you began a job by mistake or no longer need to run the job. Stop your load test jobs programmatically with the `StopInferenceRecommendationsJob` API, or through Studio Classic or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Specify the job name of the load test for the `JobName` field:

```
sagemaker_client.stop_inference_recommendations_job(
                                    JobName='<INSERT>'
                                    )
```

------
#### [ AWS CLI ]

Specify the job name of the load test for the `job-name` flag:

```
aws sagemaker stop-inference-recommendations-job --job-name <job-name>
```

------
#### [ Amazon SageMaker Studio Classic ]

Close the tab where you initiated your custom load job to stop your Inference Recommender load test.

------
#### [ SageMaker AI console ]

To stop your load test job through the SageMaker AI console, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, select your load test job.

1. Choose **Stop job**.

1. In the dialog box that pops up, choose **Confirm**.

After stopping your job, the job’s **Status** should change to **Stopping**.

------

# Troubleshoot Inference Recommender errors
<a name="inference-recommender-troubleshooting"></a>

This section contains information about how to understand and prevent common errors, the error messages they generate, and guidance on how to resolve these errors.

## How to troubleshoot
<a name="inference-recommender-troubleshooting-how-to"></a>

You can attempt to resolve your error by going through the following steps:
+ Check if you've covered all the prerequisites to use Inference Recommender. See the [Inference Recommender Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-prerequisites.html).
+ Check that you are able to deploy your model from Model Registry to an endpoint and that it can process your payloads without errors. See [Deploy a Model from the Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-deploy.html).
+ When you kick off an Inference Recommender job, you should see endpoints being created in the console and you can review the CloudWatch logs.

## Common errors
<a name="inference-recommender-troubleshooting-common"></a>

Review the following table for common Inference Recommender errors and their solutions.


| Error | Solution | 
| --- | --- | 
|  Specify `Domain` in the Model Package version 1. `Domain` is a mandatory parameter for the job.  |  Make sure you provide the ML domain or `OTHER` if unknown.  | 
|  Provided role ARN cannot be assumed and an `AWSSecurityTokenServiceException` error occurred.  |  Make sure the execution role provided has the necessary permissions specified in the prerequisites.  | 
|  Specify `Framework` in the Model Package version 1.`Framework` is a mandatory parameter for the job.  |  Make sure you provide the ML Framework or `OTHER` if unknown.  | 
|  Users at the end of prev phase is 0 while initial users of current phase is 1.  |  Users here refers to virtual users or threads used to send requests. Each phase starts with A users and ends with B users such that B > A. Between sequential phases, x\$11 and x\$12, we require that abs(x\$12.A - x\$11.B) <= 3 and >= 0.  | 
|  Total Traffic duration (across) should not be more than Job duration.  |  The total duration of all your Phases cannot exceed the Job duration.  | 
|  Burstable instance type ml.t2.medium is not allowed.  |  Inference Recommender doesn't support load testing on t2 instance family because burstable instances do not provide consistent performance.  | 
|  ResourceLimitExceeded when calling CreateEndpoint operation  |  You have exceeded a SageMaker AI resource limit. For example, Inference Recommender might be unable to provision endpoints for benchmarking if the account has reached the endpoint quota. For more information about SageMaker AI limits and quotas, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).  | 
|  ModelError when calling InvokeEndpoint operation  |  A model error can happen for the following reasons: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html)  | 
|  PayloadError when calling InvokeEndpoint operation  |  A payload error can happen for following reasons: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html)  | 

## Check CloudWatch
<a name="inference-recommender-troubleshooting-check-cw"></a>

When you kick off an Inference Recommender job, you should see endpoints being created in the console. Select one of the endpoints and view the CloudWatch logs to monitor for any 4xx/5xx errors. If you have a successful Inference Recommender job, you will be able to see the endpoint names as part of the results. Even if your Inference Recommender job is unsuccessful, you can still check the CloudWatch logs for the deleted endpoints by following the steps below:

1. Open the Amazon CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Select the Region in which you created the Inference Recommender job from the **Region** dropdown list in the top right.

1. In the navigation pane of CloudWatch, choose **Logs**, and then select **Log groups**.

1. Search for the log group called `/aws/sagemaker/Endpoints/sm-epc-*`. Select the log group based on your most recent Inference Recommender job.

You can also troubleshoot your job by checking the Inference Recommender CloudWatch logs. The Inference Recommender logs, which are published in the `/aws/sagemaker/InferenceRecommendationsJobs` CloudWatch log group, give a high level view on the progress of the job in the `<jobName>/execution` log stream. You can find detailed information on each of the endpoint configurations being tested in the `<jobName>/Endpoint/<endpointName>` log stream.

**Overview of the Inference Recommender log streams**
+ `<jobName>/execution` contains overall job information such as endpoint configurations scheduled for benchmarking, compilation job skip reason, and validation failure reason.
+ `<jobName>/Endpoint/<endpointName>` contains information such as resource creation progress, test configuration, load test stop reason, and resource cleanup status.
+ `<jobName>/CompilationJob/<compilationJobName>` contains information on compilation jobs created by Inference Recommender, such as the compilation job configuration and compilation job status.

**Create an alarm for Inference Recommender error messages**

Inference Recommender outputs log statements for errors that might be helpful while troubleshooting. With a CloudWatch log group and a metric filter, you can look for terms and patterns in this log data as the data is sent to CloudWatch. Then, you can create a CloudWatch alarm based on the log group-metric filter. For more information, see [Create a CloudWatch alarm based on a log group-metric filter](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_alarm_log_group_metric_filter.html).

## Check benchmarks
<a name="inference-recommender-troubleshooting-check-benchmarks"></a>

When you kick off an Inference Recommender job, Inference Recommender creates several benchmarks to evaluate the performance of your model on different instance types. You can use the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API to view the details for all the benchmarks. If you have a failed benchmark, you can see the failure reasons as part of the results.

To use the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API, provide the following values:
+ For `JobName`, provide the name of the Inference Recommender job.
+ For `StepType`, use `BENCHMARK` to return details about the job's benchmarks.
+ For `Status`, use `FAILED` to return details about only the failed benchmarks. For a list of the other status types, see the `Status` field in the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<region>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide the job name for the SageMaker Inference Recommender job
job_name = '<job-name>'

# Filter for benchmarks
step_type = 'BENCHMARK' 

# Filter for benchmarks that have a FAILED status
status = 'FAILED'

response = sagemaker_client.list_inference_recommendations_job_steps(
    JobName = job_name,
    StepType = step_type,
    Status = status
)
```

You can print the response object to view the results. The preceding code example stored the response in a variable called `response`:

```
print(response)
```

# Real-time inference
<a name="realtime-endpoints"></a>

 Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker AI hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling (see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md)). 

**Topics**
+ [

# Deploy models for real-time inference
](realtime-endpoints-deploy-models.md)
+ [

# Invoke models for real-time inference
](realtime-endpoints-test-endpoints.md)
+ [

# Endpoints
](realtime-endpoints-manage.md)
+ [

# Hosting options
](realtime-endpoints-options.md)
+ [

# Automatic scaling of Amazon SageMaker AI models
](endpoint-auto-scaling.md)
+ [

# Instance storage volumes
](host-instance-storage.md)
+ [

# Validation of models in production
](model-validation.md)
+ [

# Online explainability with SageMaker Clarify
](clarify-online-explainability.md)
+ [

# Fine-tune models with adapter inference components
](realtime-endpoints-adapt.md)

# Deploy models for real-time inference
<a name="realtime-endpoints-deploy-models"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

There are several options to deploy a model using SageMaker AI hosting services. You can interactively deploy a model with SageMaker Studio. Or, you can programmatically deploy a model using an AWS SDK, such as the SageMaker Python SDK or the SDK for Python (Boto3). You can also deploy by using the AWS CLI.

## Before you begin
<a name="deploy-prereqs"></a>

Before you deploy a SageMaker AI model, locate and make note of the following:
+ The AWS Region where your Amazon S3 bucket is located
+ The Amazon S3 URI path where the model artifacts are stored
+ The IAM role for SageMaker AI
+ The Docker Amazon ECR URI registry path for the custom image that contains the inference code, or the framework and version of a built-in Docker image that is supported and by AWS

 For a list of AWS services available in each AWS Region, see [Region Maps and Edge Networks](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/). See [Creating IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) for information on how to create an IAM role.

**Important**  
The Amazon S3 bucket where the model artifacts are stored must be in the same AWS Region as the model that you are creating.

## Shared resource utilization with multiple models
<a name="deployed-shared-utilization"></a>

You can deploy one or more models to an endpoint with Amazon SageMaker AI. When multiple models share an endpoint, they jointly utilize the resources that are hosted there, such as the ML compute instances, CPUs, and accelerators. The most flexible way to deploy multiple models to an endpoint is to define each model as an *inference component*.

### Inference components
<a name="inference-components"></a>

An inference component is a SageMaker AI hosting object that you can use to deploy a model to an endpoint. In the inference component settings, you specify the model, the endpoint, and how the model utilizes the resources that the endpoint hosts. To specify the model, you can specify a SageMaker AI Model object, or you can directly specify the model artifacts and image.

In the settings, you can optimize resource utilization by tailoring how the required CPU cores, accelerators, and memory are allocated to the model. You can deploy multiple inference components to an endpoint, where each inference component contains one model and the resource utilization needs for that model. 

After you deploy an inference component, you can directly invoke the associated model when you use the InvokeEndpoint action in the SageMaker API.

Inference components provide the following benefits:

**Flexibility**  
The inference component decouples the details of hosting the model from the endpoint itself. This provides more flexibility and control over how models are hosted and served with an endpoint. You can host multiple models on the same infrastructure, and you can add or remove models from an endpoint as needed. You can update each model independently.

**Scalability**  
You can specify how many copies of each model to host, and you can set a minimum number of copies to ensure that the model loads in the quantity that you require to serve requests. You can scale any inference component copy down to zero, which makes room for another copy to scale up. 

SageMaker AI packages your models as inference components when you deploy them by using:
+ SageMaker Studio Classic.
+ The SageMaker Python SDK to deploy a Model object (where you set the endpoint type to `EndpointType.INFERENCE_COMPONENT_BASED`).
+ The AWS SDK for Python (Boto3) to define `InferenceComponent` objects that you deploy to an endpoint.

## Deploy models with SageMaker Studio
<a name="deploy-models-studio"></a>

Complete the following steps to create and deploy your model interactively through SageMaker Studio. For more information about Studio, see the [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) documentation. For more walkthroughs of various deployment scenarios, see the blog [Package and deploy classical ML models and LLMs easily with Amazon SageMaker AI – Part 2](https://aws.amazon.com/blogs/machine-learning/package-and-deploy-classical-ml-and-llms-easily-with-amazon-sagemaker-part-2-interactive-user-experiences-in-sagemaker-studio/).

### Prepare your artifacts and permissions
<a name="studio-prereqs"></a>

Complete this section before creating a model in SageMaker Studio.

You have two options for bringing your artifacts and creating a model in Studio:

1. You can bring a pre-packaged `tar.gz` archive, which should include your model artifacts, any custom inference code, and any dependencies listed in a `requirements.txt` file.

1. SageMaker AI can package your artifacts for you. You only have to bring your raw model artifacts and any dependencies in a `requirements.txt` file, and SageMaker AI can provide default inference code for you (or you can override the default code with your own custom inference code). SageMaker AI supports this option for the following frameworks: PyTorch, XGBoost.

In addition to bringing your model, your AWS Identity and Access Management (IAM) role, and a Docker container (or desired framework and version for which SageMaker AI has a pre-built container), you must also grant permissions to create and deploy models through SageMaker AI Studio.

You should have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy attached to your IAM role so that you can access SageMaker AI and other relevant services. To see the prices of the instance types in Studio, you also must attach the [AWSPriceListServiceFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSPriceListServiceFullAccess.html) policy (or if you don’t want to attach the whole policy, more specifically, the `pricing:GetProducts` action).

If you choose to upload your model artifacts when creating a model (or upload a sample payload file for inference recommendations), then you must create an Amazon S3 bucket. The bucket name must be prefixed by the word `SageMaker AI`. Alternate capitalizations of SageMaker AI are also acceptable: `Sagemaker` or `sagemaker`.

We recommend that you use the bucket naming convention `sagemaker-{Region}-{accountID}`. This bucket is used to store the artifacts that you upload.

After creating the bucket, attach the following CORS (cross-origin resource sharing) policy to the bucket:

```
[
    {
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Etag"],
        "AllowedMethods": ["PUT", "POST"],
        "AllowedOrigins": ['https://*.sagemaker.aws'],
    }
]
```

You can attach a CORS policy to an Amazon S3 bucket by using any of the following methods:
+ Through the [Edit cross-origin resource sharing (CORS)](https://s3.console.aws.amazon.com/s3/bucket/bucket-name/property/cors/edit) page in the Amazon S3 console
+ Using the Amazon S3 API [PutBucketCors](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketCors.html)
+ Using the put-bucket-cors AWS CLI command:

  ```
  aws s3api put-bucket-cors --bucket="..." --cors-configuration="..."
  ```

### Create a deployable model
<a name="studio-create-model"></a>

In this step, you create a deployable version of your model in SageMaker AI by providing your artifacts along with additional specifications, such as your desired container and framework, any custom inference code, and network settings.

Create a deployable model in SageMaker Studio by doing the following:

1. Open the SageMaker Studio application.

1. In the left navigation pane, choose **Models**.

1. Choose the **Deployable models** tab.

1. On the **Deployable models** page, choose **Create**.

1. On the **Create deployable model** page, for the **Model name** field, enter a name for the model.

There are several more sections for you to fill out on the **Create deployable model** page.

The **Container definition** section looks like the following screenshot:

![\[Screenshot of the Container definition section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-container-definition.png)


**For the **Container definition** section, do the following:**

1. For **Container type**, select **Pre-built container** if you'd like to use a SageMaker AI managed container, or select **Bring your own container** if you have your own container.

1. If you selected **Pre-built container**, select the **Container framework**, **Framework version**, and **Hardware type** that you'd like to use.

1. If you selected **Bring your own container**, enter an Amazon ECR path for **ECR path to container image**.

Then, fill out the **Artifacts** section, which looks like the following screenshot:

![\[Screenshot of the Artifacts section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-artifacts-section.png)


**For the **Artifacts** section, do the following:**

1. If you're using one of the frameworks that SageMaker AI supports for packaging model artifacts (PyTorch or XGBoost), then for **Artifacts**, you can choose the **Upload artifacts** option. With this option, you can simply specify your raw model artifacts, any custom inference code you have, and your requirements.txt file, and SageMaker AI handles packaging the archive for you. Do the following:

   1. For **Artifacts**, select **Upload artifacts** to continue providing your files. Otherwise, if you already have a `tar.gz` archive that contains your model files, inference code, and `requirements.txt` file, then select **Input S3 URI to pre-packaged artifacts**.

   1. If you chose to upload your artifacts, then for **S3 bucket**, enter the Amazon S3 path to a bucket where you'd like SageMaker AI to store your artifacts after packaging them for you. Then, complete the following steps.

   1. For **Upload model artifacts**, upload your model files.

   1. For **Inference code**, select **Use default inference code** if you'd like to use default code that SageMaker AI provides for serving inference. Otherwise, select **Upload customized inference code** to use your own inference code.

   1. For **Upload requirements.txt**, upload a text file that lists any dependencies that you want to install at runtime.

1. If you're not using a framework that SageMaker AI supports for packaging model artifacts, then Studio shows you the **Pre-packaged artifacts** option, and you must provide all of your artifacts already packaged as a `tar.gz` archive. Do the following:

   1. For **Pre-packaged artifacts**, select **Input S3 URI for pre-packaged model artifacts** if you have your `tar.gz` archive already uploaded to Amazon S3. Select **Upload pre-packaged model artifacts** if you want to directly upload your archive to SageMaker AI.

   1. If you selected **Input S3 URI for pre-packaged model artifacts**, enter the Amazon S3 path to your archive for **S3 URI**. Otherwise, select and upload the archive from your local machine.

The next section is **Security**, which looks like the following screenshot:

![\[Screenshot of the Security section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-security-section.png)


**For the **Security** section, do the following:**

1. For **IAM role**, enter the ARN for an IAM role.

1. (Optional) For **Virtual Private Cloud (VPC)**, you can select an Amazon VPC for storing your model configuration and artifacts.

1. (Optional) Turn on the **Network isolation** toggle if you want to restrict your container's internet access.

Finally, you can optionally fill out the **Advanced options** section, which looks like the following screenshot:

![\[Screenshot of the Advanced options section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-advanced-options.png)


**(Optional) For the **Advanced options** section, do the following:**

1. Turn on the **Customized instance recommendations** toggle if you want to run an Amazon SageMaker Inference Recommender job on your model after its creation. Inference Recommender is a feature that provides you with recommended instance types for optimizing inference performance and cost. You can view these instance recommendations when preparing to deploy your model.

1. For **Add environment variables**, enter an environment variables for your container as key-value pairs.

1. For **Tags**, enter any tags as key-value pairs.

1. After finishing your model and container configuration, choose **Create deployable model**.

You should now have a model in SageMaker Studio that is ready for deployment.

### Deploy your model
<a name="studio-deploy"></a>

Finally, you deploy the model you configured in the previous step to an HTTPS endpoint. You can deploy either a single model or multiple models to the endpoint.

**Model and endpoint compatibility**  
Before you can deploy a model to an endpoint, the model and endpoint must be compatible by having the same values for the following settings:  
The IAM role
The Amazon VPC, including its subnets and security groups
The network isolation (enabled or disabled)
Studio prevents you from deploying models to incompatible endpoints in the following ways:  
If you attempt to deploy a model to a new endpoint, SageMaker AI configures the endpoint with initial settings that are compatible. If you break the compatibility by changing these settings, Studio shows an alert and prevents your deployment.
If you attempt to deploy to an existing endpoint, and that endpoint is incompatible, Studio shows an alert and prevents your deployment. 
If you attempt to add multiple models to a deployment, Studio prevents you from deploying models that are incompatible with each other.
When Studio shows the alert about model and endpoint incompatibility, you can choose **View details** in the alert to see which settings are incompatible.

One way to deploy a model is by doing the following in Studio:

1. Open the SageMaker Studio application.

1. In the left navigation pane, choose **Models**.

1. On the **Models** page, select one or more models from the list of SageMaker AI models.

1. Choose **Deploy**.

1. For **Endpoint name**, open the dropdown menu. You can either select an existing endpoint or you can create a new endpoint to which you deploy the model.

1. For **Instance type**, select the instance type that you want to use for the endpoint. If you previously ran an Inference Recommender job for the model, your recommended instance types appear in the list under the title **Recommended**. Otherwise, you'll see a few **Prospective instances** that might be suitable for your model.
**Instance type compatibility for JumpStart**  
If you're deploying a JumpStart model, Studio only shows instance types that the model supports.

1. For **Initial instance count**, enter the initial number of instances that you'd like to provision for your endpoint.

1. For **Maximum instance count**, specify the maximum number of instances that the endpoint can provision when it scales up to accommodate an increase in traffic.

1. If the model you're deploying is one of the most used JumpStart LLMs from the model hub, then the **Alternate configurations** option appears after the instance type and instance count fields.

   For the most popular JumpStart LLMs, AWS has pre-benchmarked instance types to optimize for either cost or performance. This data can help you decide which instance type to use for deploying your LLM. Choose **Alternate configurations** to open a dialog box that contains the pre-benchmarked data. The panel looks like the following screenshot:  
![\[Screenshot of the Alternate configurations box\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-jumpstart-alternate-configurations.png)

   In the **Alternate configurations** box, do the following:

   1. Select an instance type. You can choose **Cost per hour** or **Best performance** to see instance types that optimize either cost or performance for the specified model. You can also choose **Other supported instances** to see a list of other instance types that are compatible with the JumpStart model. Note that selecting an instance type here overwrites any previous instance selection specified in Step 6.

   1. (Optional) Turn on the **Customize the selected configuration** toggle to specify **Max total tokens** (the maximum number of tokens that you want to allow, which is the sum of your input tokens and the model's generated output), **Max input token length** (the maximum number of tokens you want to allow for the input of each request), and **Max concurrent requests** (the maximum number of requests that the model can process at a time).

   1. Choose **Select** to confirm your instance type and configuration settings.

1. The **Model** field should already be populated with the name of the model or models that you're deploying. You can choose **Add model** to add more models to the deployment. For each model that you add, fill out the following fields:

   1. For **Number of CPU cores**, enter the CPU cores that you'd like to dedicate for the model's usage.

   1. For **Min number of copies**, enter the minimum number of model copies that you want to have hosted on the endpoint at any given time.

   1. For **Min CPU memory (MB)**, enter the minimum amount of memory (in MB) that the model requires.

   1. For **Max CPU memory (MB)**, enter the maximum amount of memory (in MB) that you'd like to allow the model to use.

1. (Optional) For the **Advanced options**, do the following:

   1. For **IAM role**, use either the default SageMaker AI IAM execution role, or specify your own role that has the permissions you need. Note that this IAM role must be the same as the role that you specified when creating the deployable model.

   1. For **Virtual Private Cloud (VPC)**, you can specify a VPC in which you want to host your endpoint.

   1. For **Encryption KMS key**, select an AWS KMS key to encrypt data on the storage volume attached to the ML compute instance that hosts the endpoint.

   1. Turn on the **Enable network isolation** toggle to restrict your container's internet access.

   1. For **Timeout configuration**, enter values for the **Model data download timeout (seconds)** and **Container startup health check timeout (seconds)** fields. These values determine the maximum amount of time that SageMaker AI allows for downloading the model to the container and starting up the container, respectively.

   1. For **Tags**, enter any tags as key-value pairs.
**Note**  
SageMaker AI configures the IAM role, VPC, and network isolation settings with initial values that are compatible with the model that you're deploying. If you break the compatibility by changing these settings, Studio shows an alert and prevents your deployment.

After configuring your options, the page should look like the following screenshot.

![\[Screenshot of the Deploy model page in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-deploy-realtime-model-2.png)


After configuring your deployment, choose **Deploy** to create the endpoint and deploy your model.

## Deploy models with the Python SDKs
<a name="deploy-models-python"></a>

Using the SageMaker Python SDK, you can build your model in two ways. The first is to create a model object from the `Model` or `ModelBuilder` class. If you use the `Model` class to create your `Model` object, you need to specify the model package or inference code (depending on your model server), scripts to handle serialization and deserialization of data between the client and server, and any dependencies to be uploaded to Amazon S3 for consumption. The second way to build your model is to use `ModelBuilder` for which you provide model artifacts or inference code. `ModelBuilder` automatically captures your dependencies, infers the needed serialization and deserialization functions, and packages your dependencies to create your `Model` object. For more information about `ModelBuilder`, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md).

The following section describes both methods to create your model and deploy your model object.

### Set up
<a name="python-setup"></a>

The following examples prepare for the model deployment process. They import the necessary libraries and define the S3 URL that locates the model artifacts.

------
#### [ SageMaker Python SDK ]

**Example import statements**  
The following example imports modules from the SageMaker Python SDK, the SDK for Python (Boto3), and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
from datetime import datetime
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
from sagemaker.predictor import Predictor
from sagemaker.enums import EndpointType
from sagemaker.model import Model
from sagemaker.session import Session
```

------
#### [ boto3 inference components ]

**Example import statements**  
The following example imports modules from the SDK for Python (Boto3) and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
import botocore
import sys
import time
```

------
#### [ boto3 models (without inference components) ]

**Example import statements**  
The following example imports modules from the SDK for Python (Boto3) and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
import botocore
import datetime
from time import gmtime, strftime
```

------

**Example model artifact URL**  
The following code builds an example Amazon S3 URL. The URL locates the model artifacts for a pre-trained model in an Amazon S3 bucket.  

```
# Create a variable w/ the model S3 URL

# The name of your S3 bucket:
s3_bucket = "amzn-s3-demo-bucket"
# The directory within your S3 bucket your model is stored in:
bucket_prefix = "sagemaker/model/path"
# The file name of your model artifact:
model_filename = "my-model-artifact.tar.gz"
# Relative S3 path:
model_s3_key = f"{bucket_prefix}/"+model_filename
# Combine bucket name, model file name, and relate S3 path to create S3 model URL:
model_url = f"s3://{s3_bucket}/{model_s3_key}"
```
The full Amazon S3 URL is stored in the variable `model_url`, which is used in the examples that follow. 

### Overview
<a name="python-overview"></a>

There are multiple ways that you can deploy models with the SageMaker Python SDK or the SDK for Python (Boto3). The following sections summarize the steps that you complete for several possible approaches. These steps are demonstrated by the examples that follow.

------
#### [ SageMaker Python SDK ]

Using the SageMaker Python SDK, you can build your model in either of the following ways:
+ **Create a model object from the `Model` class** – You must specify the model package or inference code (depending on your model server), scripts to handle serialization and deserialization of data between the client and server, and any dependencies to be uploaded to Amazon S3 for consumption. 
+ **Create a model object from the `ModelBuilder` class** – You provide model artifacts or inference code, and `ModelBuilder` automatically captures your dependencies, infers the needed serialization and deserialization functions, and packages your dependencies to create your `Model` object.

  For more information about `ModelBuilder`, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md). You can also see the blog [Package and deploy classical ML models and LLMs easily with SageMaker AI – Part 1](https://aws.amazon.com/blogs/machine-learning/package-and-deploy-classical-ml-and-llms-easily-with-amazon-sagemaker-part-1-pysdk-improvements/) for more information.

The examples that follow describe both methods to create your model and deploy your model object. To deploy a model in these ways, you complete the following steps:

1. Define the endpoint resources to allocate to the model with a `ResourceRequirements` object.

1. Create a model object from the `Model` or `ModelBuilder` classes. The `ResourceRequirements` object is specified in the model settings.

1. Deploy the model to an endpoint by using the `deploy` method of the `Model` object.

------
#### [ boto3 inference components ]

The examples that follow demonstrate how to assign a model to an inference component and then deploy the inference component to an endpoint. To deploy a model in this way, you complete the following steps:

1. (Optional) Create a SageMaker AI model object by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) method.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config) method.

1. Create your endpoint by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method, and in your request, provide the endpoint configuration that you created.

1. Create an inference component by using the `create_inference_component` method. In the settings, you specify a model by doing either of the following:
   + Specifying a SageMaker AI model object
   + Specifying the model image URI and S3 URL

   You also allocate endpoint resources to the model. By creating the inference component, you deploy the model to the endpoint. You can deploy multiple models to an endpoint by creating multiple inference components — one for each model.

------
#### [ boto3 models (without inference components) ]

The examples that follow demonstrate how to create a model object and then deploy the model to an endpoint. To deploy a model in this way, you complete the following steps:

1. Create a SageMaker AI model by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) method.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config) method. In the endpoint configuration, you assign the model object to a production variant.

1. Create your endpoint by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method. In your request, provide the endpoint configuration that you created. 

   When you create the endpoint, SageMaker AI provisions the endpoint resources, and it deploys the model to the endpoint.

------

### Configure
<a name="python-configure"></a>

The following examples configure the resources that you require to deploy a model to an endpoint.

------
#### [ SageMaker Python SDK ]

The following example assigns endpoint resources to a model with a `ResourceRequirements` object. These resources include CPU cores, accelerators, and memory. Then, the example creates a model object from the `Model` class. Alternatively you can create a model object by instantiating the [ModelBuilder](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) class and running `build`—this method is also shown in the example. `ModelBuilder` provides a unified interface for model packaging, and in this instance, prepares a model for a large model deployment. The example utilizes `ModelBuilder` to construct a Hugging Face model. (You can also pass a JumpStart model). Once you build the model, you can specify resource requirements in the model object. In the next step, you use this object to deploy the model to an endpoint. 

```
resources = ResourceRequirements(
    requests = {
        "num_cpus": 2,  # Number of CPU cores required:
        "num_accelerators": 1, # Number of accelerators required
        "memory": 8192,  # Minimum memory required in Mb (required)
        "copies": 1,
    },
    limits = {},
)

now = datetime.now()
dt_string = now.strftime("%d-%m-%Y-%H-%M-%S")
model_name = "my-sm-model"+dt_string

# build your model with Model class
model = Model(
    name = "model-name",
    image_uri = "image-uri",
    model_data = model_url,
    role = "arn:aws:iam::111122223333:role/service-role/role-name",
    resources = resources,
    predictor_cls = Predictor,
)
                        
# Alternate mechanism using ModelBuilder
# uncomment the following section to use ModelBuilder
/*
model_builder = ModelBuilder(
    model="<HuggingFace-ID>", # like "meta-llama/Llama-2-7b-hf"
    schema_builder=SchemaBuilder(sample_input,sample_output),
    env_vars={ "HUGGING_FACE_HUB_TOKEN": "<HuggingFace_token>}" }
)

# build your Model object
model = model_builder.build()

# create a unique name from string 'mb-inference-component'
model.model_name = unique_name_from_base("mb-inference-component")

# assign resources to your model
model.resources = resources
*/
```

------
#### [ boto3 inference components ]

The following example configures an endpoint with the `create_endpoint_config` method. You assign this configuration to an endpoint when you create it. In the configuration, you define one or more production variants. For each variant, you can choose the instance type that you want Amazon SageMaker AI to provision, and you can enable managed instance scaling.

```
endpoint_config_name = "endpoint-config-name"
endpoint_name = "endpoint-name"
inference_component_name = "inference-component-name"
variant_name = "variant-name"

sagemaker_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = "arn:aws:iam::111122223333:role/service-role/role-name",
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 2,
            },
        }
    ],
)
```

------
#### [ boto3 models (without inference components) ]

**Example model definition**  
The following example defines a SageMaker AI model with the `create_model` method in the AWS SDK for Python (Boto3).  

```
model_name = "model-name"

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = "arn:aws:iam::111122223333:role/service-role/role-name",
    PrimaryContainer = {
        "Image": "image-uri",
        "ModelDataUrl": model_url,
    }
)
```
This example specifies the following:  
+ `ModelName`: A name for your model (in this example it is stored as a string variable called `model_name`).
+ `ExecutionRoleArn`: The Amazon Resource Name (ARN) of the IAM role that Amazon SageMaker AI can assume to access model artifacts and Docker images for deployment on ML compute instances or for batch transform jobs.
+ `PrimaryContainer`: The location of the primary Docker image containing inference code, associated artifacts, and custom environment maps that the inference code uses when the model is deployed for predictions.

**Example endpoint configuration**  
The following example configures an endpoint with the `create_endpoint_config` method. Amazon SageMaker AI uses this configuration to deploy models. In the configuration, you identify one or more models, created with the `create_model` method, to deploy the resources that you want Amazon SageMaker AI to provision.  

```
endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName = "endpoint-config-name", 
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint:
    ProductionVariants = [
        {
            "VariantName": "variant-name", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ]
)
```
This example specifies the following keys for the `ProductionVariants` field:  
+ `VariantName`: The name of the production variant.
+ `ModelName`: The name of the model that you want to host. This is the name that you specified when creating the model.
+ `InstanceType`: The compute instance type. See the `InstanceType` field in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html) and [SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/) for a list of supported compute instance types and pricing for each instance type.

------

### Deploy
<a name="python-deploy"></a>

The following examples deploy a model to an endpoint.

------
#### [ SageMaker Python SDK ]

The following example deploys the model to a real-time, HTTPS endpoint with the `deploy` method of the model object. If you specify a value for the `resources` argument for both model creation and deployment, the resources you specify for deployment take precedence.

```
predictor = model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.p4d.24xlarge", 
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    resources = resources,
)
```

For the `instance_type` field, the example specifies the name of the Amazon EC2 instance type for the model. For the `initial_instance_count` field, it specifies the initial number of instances to run the endpoint on.

The following code sample demonstrates another case where you deploy a model to an endpoint and then deploy another model to the same endpoint. In this case you must supply the same endpoint name to the `deploy` methods of both models.

```
# Deploy the model to inference-component-based endpoint
falcon_predictor = falcon_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.p4d.24xlarge", 
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    endpoint_name = "<endpoint_name>"
    resources = resources,
)

# Deploy another model to the same inference-component-based endpoint
llama2_predictor = llama2_model.deploy( # resources already set inside llama2_model
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    endpoint_name = "<endpoint_name>"  # same endpoint name as for falcon model
)
```

------
#### [ boto3 inference components ]

Once you have an endpoint configuration, use the [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following example creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources.

```
sagemaker_client.create_endpoint(
    EndpointName = endpoint_name,
    EndpointConfigName = endpoint_config_name,
)
```

After you've created an endpoint, you can deploy one or models to it by creating inference components. The following example creates one with the `create_inference_component` method.

```
sagemaker_client.create_inference_component(
    InferenceComponentName = inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification = {
        "Container": {
            "Image": "image-uri",
            "ArtifactUrl": model_url,
        },
        "ComputeResourceRequirements": {
            "NumberOfCpuCoresRequired": 1, 
            "MinMemoryRequiredInMb": 1024
        }
    },
    RuntimeConfig = {"CopyCount": 2}
)
```

------
#### [ boto3 models (without inference components) ]

**Example deployment**  

Provide the endpoint configuration to SageMaker AI. The service launches the ML compute instances and deploys the model or models as specified in the configuration.

Once you have your model and endpoint configuration, use the [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following example creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources and deploy models.

```
create_endpoint_response = sagemaker_client.create_endpoint(
    # The endpoint name must be unique within an AWS Region in your AWS account:
    EndpointName = "endpoint-name"
    # The name of the endpoint configuration associated with this endpoint:
    EndpointConfigName = "endpoint-config-name")
```

------

## Deploy models with the AWS CLI
<a name="deploy-models-cli"></a>

You can deploy a model to an endpoint by using the AWS CLI.

### Overview
<a name="deploy-models-cli-overview"></a>

When you deploy a model with the AWS CLI, you can deploy it with or without using an inference component. The following sections summarize the commands that you run for both approaches. These commands are demonstrated by the examples that follow.

------
#### [ With inference components ]

To deploy a model with an inference component, do the following:

1. (Optional) Create a model with the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.

1. Specify the settings for your endpoint by creating an endpoint configuration. To create one, you run the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.

1. Create your endpoint by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command. In the command body, specify the endpoint configuration that you created.

1. Create an inference component by using the `create-inference-component` command. In the settings, you specify a model by doing either of the following:
   + Specifying a SageMaker AI model object
   + Specifying the model image URI and S3 URL

   You also allocate endpoint resources to the model. By creating the inference component, you deploy the model to the endpoint. You can deploy multiple models to an endpoint by creating multiple inference components — one for each model.

------
#### [ Without inference components ]

To deploy a model without using an inference component, do the following:

1. Create a SageMaker AI model by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command. In the endpoint configuration, you assign the model object to a production variant.

1. Create your endpoint by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command. In your command body, specify the endpoint configuration that you created.

   When you create the endpoint, SageMaker AI provisions the endpoint resources, and it deploys the model to the endpoint.

------

### Configure
<a name="cli-configure-endpoint"></a>

The following examples configure the resources that you require to deploy a model to an endpoint.

------
#### [ With inference components ]

**Example create-endpoint-config command**  
The following example creates an endpoint configuration with the [create-endpoint-config](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.  

```
aws sagemaker create-endpoint-config \
--endpoint-config-name endpoint-config-name \
--execution-role-arn arn:aws:iam::111122223333:role/service-role/role-name\
--production-variants file://production-variants.json
```
In this example, the file `production-variants.json` defines a production variant with the following JSON:  

```
[
    {
        "VariantName": "variant-name",
        "ModelName": "model-name",
        "InstanceType": "ml.p4d.24xlarge",
        "InitialInstanceCount": 1
    }
]
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointConfigArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint-config/endpoint-config-name"
}
```

------
#### [ Without inference components ]

**Example create-model command**  
The following example creates a model with the [create-model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.  

```
aws sagemaker create-model \
--model-name model-name \
--execution-role-arn arn:aws:iam::111122223333:role/service-role/role-name \
--primary-container "{ \"Image\": \"image-uri\", \"ModelDataUrl\": \"model-s3-url\"}"
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "ModelArn": "arn:aws:sagemaker:us-west-2:111122223333:model/model-name"
}
```

**Example create-endpoint-config command**  
The following example creates an endpoint configuration with the [create-endpoint-config](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.  

```
aws sagemaker create-endpoint-config \
--endpoint-config-name endpoint-config-name \
--production-variants file://production-variants.json
```
In this example, the file `production-variants.json` defines a production variant with the following JSON:  

```
[
    {
        "VariantName": "variant-name",
        "ModelName": "model-name",
        "InstanceType": "ml.p4d.24xlarge",
        "InitialInstanceCount": 1
    }
]
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointConfigArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint-config/endpoint-config-name"
}
```

------

### Deploy
<a name="cli-deploy"></a>

The following examples deploy a model to an endpoint.

------
#### [ With inference components ]

**Example create-endpoint command**  
The following example creates an endpoint with the [create-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command.  

```
aws sagemaker create-endpoint \
--endpoint-name endpoint-name \
--endpoint-config-name endpoint-config-name
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/endpoint-name"
}
```

**Example create-inference-component command**  
The following example creates an inference component with the create-inference-component command.  

```
aws sagemaker create-inference-component \
--inference-component-name inference-component-name \
--endpoint-name endpoint-name \
--variant-name variant-name \
--specification file://specification.json \
--runtime-config "{\"CopyCount\": 2}"
```
In this example, the file `specification.json` defines the container and compute resources with the following JSON:  

```
{
    "Container": {
        "Image": "image-uri",
        "ArtifactUrl": "model-s3-url"
    },
    "ComputeResourceRequirements": {
        "NumberOfCpuCoresRequired": 1,
        "MinMemoryRequiredInMb": 1024
    }
}
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "InferenceComponentArn": "arn:aws:sagemaker:us-west-2:111122223333:inference-component/inference-component-name"
}
```

------
#### [ Without inference components ]

**Example create-endpoint command**  
The following example creates an endpoint with the [create-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command.  

```
aws sagemaker create-endpoint \
--endpoint-name endpoint-name \
--endpoint-config-name endpoint-config-name
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/endpoint-name"
}
```

------

# Invoke models for real-time inference
<a name="realtime-endpoints-test-endpoints"></a>

After you use Amazon SageMaker AI to deploy a model to an endpoint, you can interact with the model by sending inference requests to it. To send an inference request to a model, you invoke the endpoint that hosts it. You can invoke your endpoints using Amazon SageMaker Studio, the AWS SDKs, or the AWS CLI.

## Invoke Your Model Using Amazon SageMaker Studio
<a name="realtime-endpoints-test-endpoints-studio"></a>

After you deploy your model to an endpoint, you can view the endpoint through Amazon SageMaker Studio and test your endpoint by sending single inference requests.

**Note**  
SageMaker AI only supports endpoint testing in Studio for real-time endpoints.

**To send a test inference request to your endpoint**

1. Launch Amazon SageMaker Studio.

1. In the navigation pane on the left, choose **Deployments**.

1. From the dropdown, choose **Endpoints**.

1. Find for your endpoint by name, and choose the name in the table. The endpoint names listed in the **Endpoints** panel are defined when you deploy a model. The Studio workspace opens the **Endpoint** page in a new tab.

1. Choose the **Test inference** tab.

1. For **Testing Options**, select one of the following:

   1. Select **Test the sample request** to immediately send a request to your endpoint. Use the **JSON editor** to provide sample data in JSON format, and choose **Send Request** to submit the request to your endpoint. After submitting your request, Studio shows the inference output in a card to the right of the JSON editor.

   1. Select **Use Python SDK example code** to view the code for sending a request to the endpoint. Then, copy the code example from the **Example inference request** section and run the code from your testing environment.

The top of the card shows the type of request that was sent to the endpoint (only JSON is accepted). The card shows the following fields:
+ **Status** – displays one of the following status types:
  + `Success` – The request succeeded.
  + `Failed` – The request failed. A response appears under **Failure Reason**.
  + `Pending` – While the inference request is pending, the status shows a spinning, circular icon.
+ **Execution Length** – How long the invocation took (end time minus the start time) in milliseconds.
+ **Request Time** – How many minutes have passed since the request was sent.
+ **Result Time** – How many minutes have passed since the result was returned.

## Invoke Your Model by Using the AWS SDK for Python (Boto3)
<a name="realtime-endpoints-test-endpoints-api"></a>

If you want to invoke a model endpoint in your application code, you can use one of the AWS SDKs, including the AWS SDK for Python (Boto3). To invoke your endpoint with this SDK, you use one of the following Python methods:
+ `invoke_endpoint` – Sends an inference request to a model endpoint and returns the response that the model generates. 

  This method returns the inference payload as one response after the model finishes generating it. For more information, see [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) in the *AWS SDK for Python (Boto3) API Reference*.
+ `invoke_endpoint_with_response_stream` – Sends an inference request to a model endpoint and streams the response incrementally while the model generates it. 

  With this method, your application receives parts of the response as soon as the parts become available. For more information, see [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) in the *AWS SDK for Python (Boto3) API Reference*.

  Use this method only to invoke models that support inference streaming.

Before you can use these methods in your application code, you must initialize a SageMaker AI Runtime client, and you must specify the name of your endpoint. The following example sets up the client and endpoint for the rest of the examples that follow:

```
import boto3

sagemaker_runtime = boto3.client(
    "sagemaker-runtime", region_name='aws_region')

endpoint_name='endpoint-name'
```

### Invoke to Get an Inference Response
<a name="test-invoke-endpoint"></a>

The following example uses the `invoke_endpoint` method to invoke an endpoint with the AWS SDK for Python (Boto3):

```
# Gets inference from the model hosted at the specified endpoint:
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name, 
    Body=bytes('{"features": ["This is great!"]}', 'utf-8')
    )

# Decodes and prints the response body:
print(response['Body'].read().decode('utf-8'))
```

This example provide input data in the `Body` field for SageMaker AI to pass to the model. This data must be in the same format that was used for training. The example assigns the response to the `response` variable.

The `response` variable provides access to the HTTP status, the name of the deployed model, and other fields. The following snippet prints the HTTP status code:

```
print(response["HTTPStatusCode"])
```

### Invoke to Stream an Inference Response
<a name="test-invoke-endpoint-with-response-stream"></a>

If you deployed a model that supports inference streaming, you can invoke the model to receive its inference payload as a stream of parts. The model delivers these parts incrementally as the model generates them. When an application receives an inference stream, the application doesn't need to wait for the model to generate the whole response payload. Instead, the application immediately receives parts of the response as they become available. 

By consuming an inference stream in your application, you can create interactions where your users perceive the inference to be fast because they get the first part immediately. You can implement streaming to support fast interactive experiences, such as chatbots, virtual assistants, and music generators. For example, you could create a chatbot that incrementally shows the text generated by a large language model (LLM).

To get an inference stream, you can use the `invoke_endpoint_with_response_stream` method. In the response body, the SDK provides an `EventStream` object, which gives the inference as a series of `PayloadPart` objects.

**Example Inference Stream**  
The following example is a stream of `PayloadPart` objects:  

```
{'PayloadPart': {'Bytes': b'{"outputs": [" a"]}\n'}}
{'PayloadPart': {'Bytes': b'{"outputs": [" challenging"]}\n'}}
{'PayloadPart': {'Bytes': b'{"outputs": [" problem"]}\n'}}
. . .
```
In each payload part, the `Bytes` field provides a portion of the inference response from the model. This portion can be any content type that a model generates, such as text, image, or audio data. In this example, the portions are JSON objects that contain generated text from an LLM.  
Usually, the payload part contains a discrete chunk of data from the model. In this example, the discrete chunks are whole JSON objects. Occasionally, the streaming response splits the chunks over multiple payload parts, or it combines multiple chunks into one payload part. The following example shows a chunk of data in JSON format that's split over two payload parts:  

```
{'PayloadPart': {'Bytes': b'{"outputs": '}}
{'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
```
When you write application code that processes an inference stream, include logic that handles these occasional splits and combinations of data. As one strategy, you could write code that concatenates the contents of `Bytes` while your application receives the payload parts. By concatenating the example JSON data here, you would combine the data into a newline-delimited JSON body. Then, your code could process the stream by parsing the whole JSON object on each line.  
The following example shows the newline-delimited JSON that you would create when you concatenate the example contents of `Bytes`:  

```
{"outputs": [" a"]}
{"outputs": [" challenging"]}
{"outputs": [" problem"]}
. . .
```

**Example Code to Process an Inference Stream**  

The following example Python class, `SmrInferenceStream`, demonstrates how you can process an inference stream that sends text data in JSON format:

```
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                resp = json.loads(line)
                part = resp.get("outputs")[0]
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]
```

This example processes the inference stream by doing the following:
+ Initializes a SageMaker AI Runtime client and sets the name of a model endpoint. Before you can get an inference stream, the model that the endpoint hosts must support inference streaming.
+ In the example `stream_inference` method, receives a request body and passes it to the `invoke_endpoint_with_response_stream` method of the SDK.
+ Iterates over each event in the `EventStream` object that the SDK returns.
+ From each event, gets the contents of the `Bytes` object in the `PayloadPart` object.
+ In the example `_write` method, writes to a buffer to concatenate the contents of the `Bytes` objects. The combined contents form a newline-delimited JSON body.
+ Uses the example `_readlines` method to get an iterable series of JSON objects.
+ In each JSON object, gets a piece of the inference.
+ With the `yield` expression, returns the pieces incrementally.

The following example creates and uses a `SmrInferenceStream` object:

```
request_body = {"inputs": ["Large model inference is"],
                "parameters": {"max_new_tokens": 100,
                               "enable_sampling": "true"}}
smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, endpoint_name)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')
```

This example passes a request body to the `stream_inference` method. It iterates over the response to print each piece that the inference stream returns.

The example assumes that the model at the specified endpoint is an LLM that generates text. The output from this example is a body of generated text that prints incrementally:

```
a challenging problem in machine learning. The goal is to . . .
```

## Invoke Your Model by Using the AWS CLI
<a name="realtime-endpoints-test-endpoints-cli"></a>

You can invoke your model endpoint by running commands with the AWS Command Line Interface (AWS CLI). The AWS CLI supports standard inference requests with the `invoke-endpoint` command, and it supports asynchronous inference requests with the `invoke-endpoint-async` command.

**Note**  
The AWS CLI doesn't support streaming inference requests.

The following example uses the `invoke-endpoint` command to send an inference request to a model endpoint:

```
aws sagemaker-runtime invoke-endpoint \
    --endpoint-name endpoint_name \
    --body fileb://$file_name \
    output_file.txt
```

For the `--endpoint-name` parameter, provide the endpoint name that you specified when you created the endpoint. For the `--body` parameter, provide input data for SageMaker AI to pass to the model. The data must be in the same format that was used for training. This example shows how to send binary data to your endpoint.

For more information on when to use `file://` over `fileb://` when passing the contents of a file to a parameter of the AWS CLI, see [Best Practices for Local File Parameters](https://aws.amazon.com/blogs/developer/best-practices-for-local-file-parameters/).

For more information, and to see additional parameters that you can pass, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html) in the *AWS CLI Command Reference*.

If the `invoke-endpoint` command succeeds it returns a response such as the following:

```
{
    "ContentType": "<content_type>; charset=utf-8",
    "InvokedProductionVariant": "<Variant>"
}
```

If the command doesn't succeed, check whether the input payload is in the correct format.

View the output of the invocation by checking the file output file (`output_file.txt` in this example).

```
more output_file.txt
```

## Invoke Your Model by Using the AWS SDK for Python
<a name="realtime-endpoints-test-endpoints-sdk"></a>

### Invoke to Bidirectionally Stream an Inference Request and Response
<a name="realtime-endpoints-test-endpoints-sdk-overview"></a>

If you want to invoke a model endpoint in your application code to supports bidirectional streaming, you can use the [new experimental SDK for Python](https://github.com/awslabs/aws-sdk-python) that supports bidirectional streaming capability with HTTP/2 support. This SDK enables real-time, two-way communication between your client application and the SageMaker endpoint, allowing you to send inference requests incrementally while simultaneously receiving streaming responses as the model generates them. This is particularly useful for interactive applications where both the client and server need to exchange data continuously over a persistent connection.

**Note**  
The new experimental SDK is different from the standard Boto3 SDK and supports persistent bidirectional connections for data exchange. While using the experimental Python SDK we strongly advise strict pinning to a version of the SDK for any non-experimental use cases.

To invoke your endpoint with bidirectional streaming, use the `invoke_endpoint_with_bidirectional_stream` method. This method establishes a persistent connection that allows you to stream multiple payload chunks to your model while receiving responses in real-time as the model processes data. The connection remains open until you explicitly close the input stream or the endpoint closes the connection, supporting up to 30 minutes of connection time.

### Prerequisites
<a name="realtime-endpoints-test-endpoints-sdk-prereq"></a>

Before you can use bidirectional streaming in your application code, you must:

1. Install the experimental SageMaker Runtime HTTP/2 SDK

1. Set up AWS credentials for your SageMaker Runtime client

1. Deploy a model that supports bidirectional streaming to a SageMaker endpoint

### Set up the bidirectional streaming client
<a name="realtime-endpoints-test-endpoints-sdk-setup-client"></a>

The following example shows how to initialize the required components for bidirectional streaming:

```
from sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
from sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
from smithy_aws_core.identity import EnvironmentCredentialsResolver
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme

# Configuration
AWS_REGION = "us-west-2"
BIDI_ENDPOINT = f"https://runtime.sagemaker.{AWS_REGION}.amazonaws.com:8443"
ENDPOINT_NAME = "your-endpoint-name"

# Initialize the client configuration
config = Config(
    endpoint_uri=BIDI_ENDPOINT,
    region=AWS_REGION,
    aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
    auth_scheme_resolver=HTTPAuthSchemeResolver(),
    auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")}
)

# Create the SageMaker Runtime HTTP/2 client
client = SageMakerRuntimeHTTP2Client(config=config)
```

### Complete bidirectional streaming client
<a name="realtime-endpoints-test-endpoints-sdk-complete-client"></a>

The following example demonstrates how to create a bidirectional streaming client that sends multiple text payloads to a SageMaker endpoint and processes responses in real-time:

```
import asyncio
import logging
from sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
from sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
from sagemaker_runtime_http2.models import (
    InvokeEndpointWithBidirectionalStreamInput, 
    RequestStreamEventPayloadPart, 
    RequestPayloadPart
)
from smithy_aws_core.identity import EnvironmentCredentialsResolver
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SageMakerBidirectionalClient:
    
    def __init__(self, endpoint_name, region="us-west-2"):
        self.endpoint_name = endpoint_name
        self.region = region
        self.client = None
        self.stream = None
        self.response_task = None
        self.is_active = False
        
    def _initialize_client(self):
        bidi_endpoint = f"runtime.sagemaker.{self.region}.amazonaws.com:8443"
        config = Config(
            endpoint_uri=bidi_endpoint,
            region=self.region,
            aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
            auth_scheme_resolver=HTTPAuthSchemeResolver(),
            auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")}
        )
        self.client = SageMakerRuntimeHTTP2Client(config=config)
    
    async def start_session(self):
        """Establish a bidirectional streaming connection with the endpoint."""
        if not self.client:
            self._initialize_client()
            
        logger.info(f"Starting session with endpoint: {self.endpoint_name}")
        self.stream = await self.client.invoke_endpoint_with_bidirectional_stream(
            InvokeEndpointWithBidirectionalStreamInput(endpoint_name=self.endpoint_name)
        )
        self.is_active = True
        
        # Start processing responses concurrently
        self.response_task = asyncio.create_task(self._process_responses())
    
    async def send_message(self, message):
        """Send a single message to the endpoint."""
        if not self.is_active:
            raise RuntimeError("Session not active. Call start_session() first.")
            
        logger.info(f"Sending message: {message}")
        payload = RequestPayloadPart(bytes_=message.encode('utf-8'))
        event = RequestStreamEventPayloadPart(value=payload)
        await self.stream.input_stream.send(event)
    
    async def send_multiple_messages(self, messages, delay=1.0):
        """Send multiple messages with a delay between each."""
        for message in messages:
            await self.send_message(message)
            await asyncio.sleep(delay)
    
    async def end_session(self):
        """Close the bidirectional streaming connection."""
        if not self.is_active:
            return
            
        await self.stream.input_stream.close()
        self.is_active = False
        logger.info("Stream closed")
        
        # Cancel the response processing task
        if self.response_task and not self.response_task.done():
            self.response_task.cancel()
    
    async def _process_responses(self):
        """Process incoming responses from the endpoint."""
        try:
            output = await self.stream.await_output()
            output_stream = output[1]
            
            while self.is_active:
                result = await output_stream.receive()
                
                if result is None:
                    logger.info("No more responses")
                    break
                
                if result.value and result.value.bytes_:
                    response_data = result.value.bytes_.decode('utf-8')
                    logger.info(f"Received: {response_data}")
                    
        except Exception as e:
            logger.error(f"Error processing responses: {e}")

# Example usage
async def run_bidirectional_client():
    client = SageMakerBidirectionalClient(endpoint_name="your-endpoint-name")
    
    try:
        # Start the session
        await client.start_session()
        
        # Send multiple messages
        messages = [
            "I need help with", 
            "my account balance", 
            "I can help with that", 
            "and recent charges"
        ]
        await client.send_multiple_messages(messages)
        
        # Wait for responses to be processed
        await asyncio.sleep(2)
        
        # End the session
        await client.end_session()
        logger.info("Session ended successfully")
        
    except Exception as e:
        logger.error(f"Client error: {e}")
        await client.end_session()

if __name__ == "__main__":
    asyncio.run(run_bidirectional_client())
```

The client initializes the SageMaker Runtime HTTP/2 client with the regional endpoint URI on port 8443, which is required for bidirectional streaming connections. The start\$1`session()` method calls `invoke_endpoint_with_bidirectional_stream()` to establish the persistent connection and creates an asynchronous task to process incoming responses concurrently.

The `send_event()` method wraps payload data in the appropriate request objects and sends them through the input stream, while the `_process_responses()` method continuously listens for and processes responses from the endpoint as they arrive. This bidirectional approach enables real-time interaction where both sending requests and receiving responses happen simultaneously over the same connection.

# Endpoints
<a name="realtime-endpoints-manage"></a>

After deploying your model to an endpoint, you might want to view and manage the endpoint. With SageMaker AI, you can view the status and details of your endpoint, check metrics and logs to monitor your endpoint’s performance, update the models deployed to your endpoint, and more.

The following sections show how you can manage endpoints within Amazon SageMaker Studio or within the AWS Management Console.

The following page describes how to interactively view and make changes to your endpoints using the Amazon SageMaker AI console or SageMaker Studio.

**Topics**
+ [

# View endpoint details in SageMaker Studio
](manage-endpoints-studio.md)
+ [

# View endpoint details in the SageMaker AI console
](manage-endpoints-console.md)

# View endpoint details in SageMaker Studio
<a name="manage-endpoints-studio"></a>

In Amazon SageMaker Studio, you can view and manage your SageMaker AI Hosting endpoints. To learn more about Studio, see [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html).

To find the list of your endpoints in SageMaker Studio do the following:

1. Open the Studio application.

1. In the left navigation pane, choose **Deployments**.

1. From the dropdown menu, choose **Endpoints**.

The **Endpoints** page opens, which lists all of your SageMaker AI Hosting endpoints. From this page, you can see the endpoints and their **Status**. You can also create a new endpoint, edit an existing endpoint, or delete an endpoint.

To see the details for a specific endpoint, choose an endpoint from the list. On the endpoint’s details page, you get an overview like the following screenshot.

![\[Screenshot of an endpoint's main page showing a summary of the endpoint details in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-endpoint-details-page.png)


Each endpoint details page contains the following tabs of information:

# View Variants (or Models)
<a name="manage-endpoints-studio-variants"></a>

The **Variants** tab (also called the **Models** tab if your endpoint has multiple models deployed) shows you the list of [model variants](https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html) or models currently deployed to your endpoint. The following screenshot shows you what the overview and **Models** section looks like for an endpoint with multiple models deployed.

![\[Screenshot of an endpoint's main page showing multiple models deployed.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-goldfinch-multi-model-endpoint.png)


You can add or edit the settings for each variant or model. You can also select a variant and enable a default auto-scaling policy, which you can edit later in the **Auto-scaling** tab.

# View settings
<a name="manage-endpoints-studio-settings"></a>

On the **Settings** tab, you can view the endpoint’s associated AWS IAM role, the AWS KMS key used for encryption (if applicable), the name of your VPC, and the network isolation settings.

# Test inference
<a name="manage-endpoints-studio-test"></a>

On the **Test inference** tab, you can send a test inference request to a deployed model. This is useful if you’d like to verify that your endpoint responds to requests as expected.

To test inference, do the following:

1. On the model's **Test inference** tab, choose one of the following options:

   1. Select **Enter the request body** if you’d like to test the endpoint and receive a response through the Studio interface.

   1. Select **Copy example code (Python)** if you’d like to copy an AWS SDK for Python (Boto3) example that you can use to invoke your endpoint from a local environment and receive a response programmatically.

1. For **Model**, select the model that you want to test on the endpoint.

1. If you chose the Studio interface testing method, then you can also choose your desired **Content type** for the response from the dropdown.

After configuring your request, then you can either choose **Send request** (to receive a response through the Studio interface) or **Copy** to copy the Python example.

If you receive a response through the Studio interface, it’ll look like the following screenshot.

![\[Screenshot of a successful inference test request on an endpoint in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/endpoint-test-inference.png)


# Auto-scaling
<a name="manage-endpoints-studio-autoscaling"></a>

On the **Auto-scaling** tab, you can view any auto-scaling policies configured for the models hosted on your endpoint. The following screenshot shows you the **Auto-scaling** tab.

![\[Screenshot of the Auto-scaling tab, showing one active policy.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-endpoint-autoscaling.png)


You can choose **Edit auto-scaling** to change any of the policies and turn on or turn off the default auto-scaling policy.

To learn more about auto-scaling for real-time endpoints, see [Automatically Scale Amazon SageMaker AI Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html). If you’re not sure how to configure an auto-scaling policy for your endpoint, you can use an [Inference Recommender autoscaling recommendations job](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-autoscaling.html) to get recommendations for an auto-scaling policy.

# View endpoint details in the SageMaker AI console
<a name="manage-endpoints-console"></a>

To view your endpoints in the SageMaker AI console, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**.

1. From the dropdown list, choose **Endpoints**.

1. On the **Endpoints** page, choose your endpoint.

The endpoint details page should open, showing you a summary of your endpoint and metrics that have been collected for your endpoint.

The following sections describe the tabs on the endpoints details page.

# Endpoints monitoring
<a name="manage-endpoints-console-monitoring"></a>

After creating a SageMaker AI Hosting endpoint, you can monitor your endpoint using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. Using these metrics, you can access historical information and gain a better perspective on how your endpoint is performing. For more information, see the *[Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/)*.

From the **Monitoring** tab on the endpoint details page, you can view CloudWatch metrics data that has been collected from your endpoint.

The **Monitoring** tab includes the following sections:
+ **Operational metrics**: View metrics that track the utilization of your endpoint’s resources, such as CPU Utilization and Memory Utilization.
+ **Invocation metrics**: View metrics that track the number, health, and status of `InvokeEndpoint` requests coming to your endpoint, such as Invocation Model Errors and Model Latency.
+ **Health metrics**: View metrics that track your endpoint’s overall health, such as Invocation Failures and Notification Failures.

For detailed descriptions of each metric, see [Monitor SageMaker AI with CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

The following screenshot shows the **Operational metrics** section for a serverless endpoint.

![\[Screenshot of metrics graphs in the operational metrics section of the endpoint details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hosting-operational-metrics.png)




You can adjust the **Period** and **Statistic** that you want to track for the metrics in a given section, as well as the length of time for which you want to view metrics data. You can also add and remove metric widgets from the view for each section by choosing **Add widget**. In the **Add widget **dialog box, you can select and deselect the metrics that you want to see.

The metrics that are available may depend on your endpoint type. For example, serverless endpoints have some metrics that aren’t available for real-time endpoints. For more specific metrics information by endpoint type, see the following pages:
+ [Monitor a serverless endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-monitoring.html)
+ [Monitor an asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-monitor.html)
+ [CW Metrics for Multi-Model Endpoint Deployments](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoint-cloudwatch-metrics.html)
+ [Inference Pipeline Logs and Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-logs-metrics.html)

# Settings
<a name="manage-endpoints-console-settings"></a>

You can choose the **Settings** tab to view additional information about your endpoint, such as the data capture settings, the endpoint configuration, and tags.

# Create and view alarms
<a name="manage-endpoints-console-alarms"></a>

From the **Alarms** tab on your endpoint details page, you can view and create simple static threshold metric alarms, where you specify a threshold value for a metric. If the metric breaches the threshold value, the alarm goes into the `ALARM` state. For more information about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html).

In the **Endpoint summary** section, you can view the **Alarms** field, which tells you how many alarms are currently active on your endpoint.

To view which alarms are in the `ALARM` state, choose the **Alarms** tab. The **Alarms** tab shows you a full list of your endpoint alarms, along with details about their status and conditions. The following screenshot shows a list of alarms in this section that have been configured for an endpoint.

![\[Screenshot of the alarms tab on the endpoint details page which shows a list of CloudWatch alarms.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hosting-alarms-tab.png)


An alarm’s status can be `In alarm`, `OK`, or `Insufficient data` if there isn’t enough metrics data being collected.

To create a new alarm for your endpoint, do the following:

1. In the **Alarms** tab, choose **Create alarm**.

1. The **Create alarm** page opens. For **Alarm name**, enter a name for the alarm.

1. (Optional) Enter a description for the alarm.

1. For **Metric**, choose the CloudWatch metric that you want the alarm to track.

1. For **Variant name**, choose the endpoint model variant that you want to monitor.

1. For **Statistic**, choose one of the available statistics for the metric you selected.

1. For **Period**, choose the time period to use for calculating each statistical value. For example, if you choose the Average statistic and a 5 minute period, each data point monitored by the alarm is the average of the metric’s data points at 5 minute intervals.

1. For** Evaluation periods**, enter the number of data points that you want the alarm to consider when evaluating whether to enter the alarm state or not.

1. For **Condition**, choose the conditional that you want to use for your alarm threshold.

1. For **Threshold value**, enter the desired value for your threshold.

1. (Optional) For **Notification**, you can choose **Add notification** to create or specify an Amazon SNS topic that receives a notification when your alarm state changes.

1. Choose **Create alarm**.

After creating your alarm, you can return to the **Alarms** tab to view its status at any time. From this section, you can also select the alarm and either **Edit** or **Delete** it.

# Hosting options
<a name="realtime-endpoints-options"></a>

The following topics describe available SageMaker AI realtime hosting options along with how to set up, invoke, and delete each hosting option.

**Topics**
+ [

# Single-model endpoints
](realtime-single-model.md)
+ [

# Multi-model endpoints
](multi-model-endpoints.md)
+ [

# Multi-container endpoints
](multi-container-endpoints.md)
+ [

# Inference pipelines in Amazon SageMaker AI
](inference-pipelines.md)
+ [

# Delete Endpoints and Resources
](realtime-endpoints-delete-resources.md)

# Single-model endpoints
<a name="realtime-single-model"></a>

You can create, update, and delete real-time inference endpoints that host a single model with Amazon SageMaker Studio, the AWS SDK for Python (Boto3), the SageMaker Python SDK, or the AWS CLI. For procedures and code examples, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).

# Multi-model endpoints
<a name="multi-model-endpoints"></a>

Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. They use the same fleet of resources and a shared serving container to host all of your models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because Amazon SageMaker AI manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

The following diagram shows how multi-model endpoints work compared to single-model endpoints.

![\[Diagram that shows how multi-model versus how single-model endpoints host models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/multi-model-endpoints-diagram.png)


Multi-model endpoints are ideal for hosting a large number of models that use the same ML framework on a shared serving container. If you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings. Your application should be tolerant of occasional cold start-related latency penalties that occur when invoking infrequently used models.

Multi-model endpoints support hosting both CPU and GPU backed models. By using GPU backed models, you can lower your model deployment costs through increased usage of the endpoint and its underlying accelerated compute instances.

Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints.

You can use multi-model endpoints with the following features:
+ [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) and VPCs
+ [Auto scaling](multi-model-endpoints-autoscaling.md)
+ [Serial inference pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) (but only one multi-model enabled container can be included in an inference pipeline)
+ A/B testing

You can use the AWS SDK for Python (Boto) or the SageMaker AI console to create a multi-model endpoint. For CPU backed multi-model endpoints, you can create your endpoint with custom-built containers by integrating the [Multi Model Server](https://github.com/awslabs/multi-model-server) library.

**Topics**
+ [

## How multi-model endpoints work
](#how-multi-model-endpoints-work)
+ [

## Sample notebooks for multi-model endpoints
](#multi-model-endpoint-sample-notebooks)
+ [

# Supported algorithms, frameworks, and instances for multi-model endpoints
](multi-model-support.md)
+ [

# Instance recommendations for multi-model endpoint deployments
](multi-model-endpoint-instance.md)
+ [

# Create a Multi-Model Endpoint
](create-multi-model-endpoint.md)
+ [

# Invoke a Multi-Model Endpoint
](invoke-multi-model-endpoint.md)
+ [

# Add or Remove Models
](add-models-to-endpoint.md)
+ [

# Build Your Own Container for SageMaker AI Multi-Model Endpoints
](build-multi-model-build-container.md)
+ [

# Multi-Model Endpoint Security
](multi-model-endpoint-security.md)
+ [

# CloudWatch Metrics for Multi-Model Endpoint Deployments
](multi-model-endpoint-cloudwatch-metrics.md)
+ [

# Set SageMaker AI multi-model endpoint model caching behavior
](multi-model-caching.md)
+ [

# Set Auto Scaling Policies for Multi-Model Endpoint Deployments
](multi-model-endpoints-autoscaling.md)

## How multi-model endpoints work
<a name="how-multi-model-endpoints-work"></a>

 SageMaker AI manages the lifecycle of models hosted on multi-model endpoints in the container's memory. Instead of downloading all of the models from an Amazon S3 bucket to the container when you create the endpoint, SageMaker AI dynamically loads and caches them when you invoke them. When SageMaker AI receives an invocation request for a particular model, it does the following: 

1. Routes the request to an instance behind the endpoint.

1. Downloads the model from the S3 bucket to that instance's storage volume.

1. Loads the model to the container's memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) on that accelerated compute instance. If the model is already loaded in the container's memory, invocation is faster because SageMaker AI doesn't need to download and load it.

SageMaker AI continues to route requests for a model to the instance where the model is already loaded. However, if the model receives many invocation requests, and there are additional instances for the multi-model endpoint, SageMaker AI routes some requests to another instance to accommodate the traffic. If the model isn't already loaded on the second instance, the model is downloaded to that instance's storage volume and loaded into the container's memory.

When an instance's memory utilization is high and SageMaker AI needs to load another model into memory, it unloads unused models from that instance's container to ensure that there is enough memory to load the model. Models that are unloaded remain on the instance's storage volume and can be loaded into the container's memory later without being downloaded again from the S3 bucket. If the instance's storage volume reaches its capacity, SageMaker AI deletes any unused models from the storage volume.

To delete a model, stop sending requests and delete it from the S3 bucket. SageMaker AI provides multi-model endpoint capability in a serving container. Adding models to, and deleting them from, a multi-model endpoint doesn't require updating the endpoint itself. To add a model, you upload it to the S3 bucket and invoke it. You don’t need code changes to use it.

**Note**  
When you update a multi-model endpoint, initial invocation requests on the endpoint might experience higher latencies as Smart Routing in multi-model endpoints adapt to your traffic pattern. However, once it learns your traffic pattern, you can experience low latencies for most frequently used models. Less frequently used models may incur some cold start latencies since the models are dynamically loaded to an instance.

## Sample notebooks for multi-model endpoints
<a name="multi-model-endpoint-sample-notebooks"></a>

To learn more about how to use multi-model endpoints, you can try the following sample notebooks:
+ Examples for multi-model endpoints using CPU backed instances:
  + [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) – This notebook shows how to deploy multiple XGBoost models to an endpoint.
  + [Multi-Model Endpoints BYOC Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.html) – This notebook shows how to set up and deploy a customer container that supports multi-model endpoints in SageMaker AI.
+ Example for multi-model endpoints using GPU backed instances:
  + [Run multiple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb) – This notebook shows how to use an NVIDIA Triton Inference container to deploy ResNet-50 models to a multi-model endpoint.

For instructions on how to create and access Jupyter notebook instances that you can use to run the previous examples in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you've created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The multi-model endpoint notebooks are located in the **ADVANCED FUNCTIONALITY** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

For more information about use cases for multi-model endpoints, see the following blogs and resources:
+ Video: [Hosting thousands of models on SageMaker AI](https://www.youtube.com/watch?v=XqCNTWmHsLc&t=751s)
+ Video: [SageMaker AI ML for SaaS](https://www.youtube.com/watch?v=BytpYlJ3vsQ)
+ Blog: [How to scale machine learning inference for multi-tenant SaaS use cases](https://aws.amazon.com/blogs/machine-learning/how-to-scale-machine-learning-inference-for-multi-tenant-saas-use-cases/)
+ Case study: [Veeva Systems](https://aws.amazon.com/partners/success/advanced-clinical-veeva/)

# Supported algorithms, frameworks, and instances for multi-model endpoints
<a name="multi-model-support"></a>

For information about the algorithms, frameworks, and instance types that you can use with multi-model endpoints, see the following sections.

## Supported algorithms, frameworks, and instances for multi-model endpoints using CPU backed instances
<a name="multi-model-support-cpu"></a>

The inference containers for the following algorithms and frameworks support multi-model endpoints:
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)
+ [Linear Learner Algorithm](linear-learner.md)
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)
+ [Resources for using TensorFlow with Amazon SageMaker AI](tf.md)
+ [Resources for using Scikit-learn with Amazon SageMaker AI](sklearn.md)
+ [Resources for using Apache MXNet with Amazon SageMaker AI](mxnet.md)
+ [Resources for using PyTorch with Amazon SageMaker AI](pytorch.md)

To use any other framework or algorithm, use the SageMaker AI inference toolkit to build a container that supports multi-model endpoints. For information, see [Build Your Own Container for SageMaker AI Multi-Model Endpoints](build-multi-model-build-container.md).

Multi-model endpoints support all of the CPU instance types.

## Supported algorithms, frameworks, and instances for multi-model endpoints using GPU backed instances
<a name="multi-model-support-gpu"></a>

Hosting multiple GPU backed models on multi-model endpoints is supported through the [SageMaker AI Triton Inference server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This supports all major inference frameworks such as NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C\$1\$1, and more.

To use any other framework or algorithm, you can use Triton backend for Python or C\$1\$1 to write your model logic and serve any custom model. After you have the server ready, you can start deploying 100s of Deep Learning models behind one endpoint.

Multi-model endpoints support the following GPU instance types:


| Instance family | Instance type | vCPUs | GiB of memory per vCPU | GPUs | GPU memory | 
| --- | --- | --- | --- | --- | --- | 
| p2 | ml.p2.xlarge | 4 | 15.25 | 1 | 12 | 
| p3 | ml.p3.2xlarge | 8 | 7.62 | 1 | 16 | 
| g5 | ml.g5.xlarge | 4 | 4 | 1 | 24 | 
| g5 | ml.g5.2xlarge | 8 | 4 | 1 | 24 | 
| g5 | ml.g5.4xlarge | 16 | 4 | 1 | 24 | 
| g5 | ml.g5.8xlarge | 32 | 4 | 1 | 24 | 
| g5 | ml.g5.16xlarge | 64 | 4 | 1 | 24 | 
| g4dn | ml.g4dn.xlarge | 4 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.2xlarge | 8 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.4xlarge | 16 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.8xlarge | 32 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.16xlarge | 64 | 4 | 1 | 16 | 

# Instance recommendations for multi-model endpoint deployments
<a name="multi-model-endpoint-instance"></a>

There are several items to consider when selecting a SageMaker AI ML instance type for a multi-model endpoint:
+ Provision sufficient [ Amazon Elastic Block Store (Amazon EBS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) capacity for all of the models that need to be served.
+ Balance performance (minimize cold starts) and cost (don’t over-provision instance capacity). For information about the size of the storage volume that SageMaker AI attaches for each instance type for an endpoint and for a multi-model endpoint, see [Instance storage volumes](host-instance-storage.md).
+ For a container configured to run in `MultiModel` mode, the storage volume provisioned for its instances are larger than the default `SingleModel` mode. This allows more models to be cached on the instance storage volume than in `SingleModel` mode.

When choosing a SageMaker AI ML instance type, consider the following:
+ Multi-model endpoints are currently supported for all CPU instances types and on single-GPU instance types.
+ For the traffic distribution (access patterns) to the models that you want to host behind the multi-model endpoint, along with the model size (how many models could be loaded in memory on the instance), keep the following information in mind:
  + Think of the amount of memory on an instance as the cache space for models to be loaded, and think of the number of vCPUs as the concurrency limit to perform inference on the loaded models (assuming that invoking a model is bound to CPU).
  + For CPU backed instances, the number of vCPUs impacts your maximum concurrent invocations per instance (assuming that invoking a model is bound to CPU). A higher amount of vCPUs enables you to invoke more unique models concurrently.
  + For GPU backed instances, a higher amount of instance and GPU memory enables you to have more models loaded and ready to serve inference requests.
  + For both CPU and GPU backed instances, have some "slack" memory available so that unused models can be unloaded, and especially for multi-model endpoints with multiple instances. If an instance or an Availability Zone fails, the models on those instances will be rerouted to other instances behind the endpoint.
+ Determine your tolerance to loading/downloading times:
  + d instance type families (for example, m5d, c5d, or r5d) and g5s come with an NVMe (non-volatile memory express) SSD, which offers high I/O performance and might reduce the time it takes to download models to the storage volume and for the container to load the model from the storage volume.
  + Because d and g5 instance types come with an NVMe SSD storage, SageMaker AI does not attach an Amazon EBS storage volume to these ML compute instances that hosts the multi-model endpoint. Auto scaling works best when the models are similarly sized and homogenous, that is when they have similar inference latency and resource requirements.

You can also use the following guidance to help you optimize model loading on your multi-model endpoints:

**Choosing an instance type that can't hold all of the targeted models in memory**

In some cases, you might opt to reduce costs by choosing an instance type that can't hold all of the targeted models in memory at once. SageMaker AI dynamically unloads models when it runs out of memory to make room for a newly targeted model. For infrequently requested models, you sacrifice dynamic load latency. In cases with more stringent latency needs, you might opt for larger instance types or more instances. Investing time up front for performance testing and analysis helps you to have successful production deployments.

**Evaluating your model cache hits**

Amazon CloudWatch metrics can help you evaluate your models. For more information about metrics you can use with multi-model endpoints, see [CloudWatch Metrics for Multi-Model Endpoint Deployments](multi-model-endpoint-cloudwatch-metrics.md).

 You can use the `Average` statistic of the `ModelCacheHit` metric to monitor the ratio of requests where the model is already loaded. You can use the `SampleCount` statistic for the `ModelUnloadingTime` metric to monitor the number of unload requests sent to the container during a time period. If models are unloaded too frequently (an indicator of *thrashing*, where models are being unloaded and loaded again because there is insufficient cache space for the working set of models), consider using a larger instance type with more memory or increasing the number of instances behind the multi-model endpoint. For multi-model endpoints with multiple instances, be aware that a model might be loaded on more than 1 instance.

# Create a Multi-Model Endpoint
<a name="create-multi-model-endpoint"></a>

You can use the SageMaker AI console or the AWS SDK for Python (Boto) to create a multi-model endpoint. To create either a CPU or GPU backed endpoint through the console, see the console procedure in the following sections. If you want to create a multi-model endpoint with the AWS SDK for Python (Boto), use either the CPU or GPU procedure in the following sections. The CPU and GPU workflows are similar but have several differences, such as the container requirements.

**Topics**
+ [

## Create a multi-model endpoint (console)
](#create-multi-model-endpoint-console)
+ [

## Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)
](#create-multi-model-endpoint-sdk-cpu)
+ [

## Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)
](#create-multi-model-endpoint-sdk-gpu)

## Create a multi-model endpoint (console)
<a name="create-multi-model-endpoint-console"></a>

You can create both CPU and GPU backed multi-model endpoints through the console. Use the following procedure to create a multi-model endpoint through the SageMaker AI console.

**To create a multi-model endpoint (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Model**, and then from the **Inference** group, choose **Create model**. 

1. For **Model name**, enter a name.

1. For **IAM role**, choose or create an IAM role that has the `AmazonSageMakerFullAccess` IAM policy attached. 

1.  In the **Container definition** section, for **Provide model artifacts and inference image options**, choose **Use multiple models**.  
![\[The section of the Create model page where you can choose Use multiple models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mme-create-model-ux-2.PNG)

1. For the **Inference container image**, enter the Amazon ECR path for your desired container image.

   For GPU models, you must use a container backed by the NVIDIA Triton Inference Server. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). For more information about the NVIDIA Triton Inference Server, see [Use Triton Inference Server with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html).

1. Choose **Create model**.

1. Deploy your multi-model endpoint as you would a single model endpoint. For instructions, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

## Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)
<a name="create-multi-model-endpoint-sdk-cpu"></a>

Use the following section to create a multi-model endpoint backed by CPU instances. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs just as you would create a single model endpoint, but with two changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model.

For a sample notebook that uses SageMaker AI to deploy multiple XGBoost models to an endpoint, see [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html). 

The following procedure outlines the key steps used in that sample to create a CPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Get a container with an image that supports deploying multi-model endpoints. For a list of built-in algorithms and framework containers that support multi-model endpoints, see [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). For this example, we use the [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md) built-in algorithm. We call the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the K-Nearest Neighbors built-in algorithm image.

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   image = sagemaker.image_uris.retrieve("knn",region=region)
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint for the model. We recommend configuring your endpoints with at least two instances. This allows SageMaker AI to provide a highly available set of predictions across multiple Availability Zones for the models.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.m4.xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

## Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)
<a name="create-multi-model-endpoint-sdk-gpu"></a>

Use the following section to create a GPU backed multi-model endpoint. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs similarly to creating single model endpoints, but there are several changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model. For GPU backed multi-model endpoints, you also must use a container with the NVIDIA Triton Inference Server that is optimized for running on GPU instances. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only).

For an example notebook that demonstrates how to create a multi-model endpoint backed by GPUs, see [Run mulitple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb).

The following procedure outlines the key steps to create a GPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Define the container image. To create a multi-model endpoint with GPU support for ResNet models, define the container to use the [NVIDIA Triton Server image](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container supports multi-model endpoints and is optimized for running on GPU instances. We call the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the image. For example:

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   
   // Find the sagemaker-tritonserver image at 
   // https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/resnet50/triton_resnet50.ipynb
   // Find available tags at https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only
   
   image = "<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-tritonserver:<TAG>".format(
       account_id=account_id_map[region], region=region
   )
   
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel',
                 "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "resnet"},
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint with GPU backed instances for the model. We recommend configuring your endpoints with more than one instance to allow for high availability and higher cache hits.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.g4dn.4xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

# Invoke a Multi-Model Endpoint
<a name="invoke-multi-model-endpoint"></a>

To invoke a multi-model endpoint, use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) from the SageMaker AI Runtime just as you would invoke a single model endpoint, with one change. Pass a new `TargetModel` parameter that specifies which of the models at the endpoint to target. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

The following procedures are the same for both CPU and GPU-backed multi-model endpoints.

------
#### [ AWS SDK for Python (Boto 3) ]

The following example prediction request uses the [AWS SDK for Python (Boto 3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) in the sample notebook.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName = "<ENDPOINT_NAME>",
                        ContentType  = "text/csv",
                        TargetModel  = "<MODEL_FILENAME>.tar.gz",
                        Body         = body)
```

------
#### [ AWS CLI ]

 The following example shows how to make a CSV request with two rows using the AWS Command Line Interface (AWS CLI):

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name "<ENDPOINT_NAME>" \
  --body "1.0,2.0,5.0"$'\n'"2.0,3.0,4.0" \
  --content-type "text/csv" \
  --target-model "<MODEL_NAME>.tar.gz"
  output_file.txt
```

An `output_file.txt` with information about your inference requests is made if the inference was successful. For more examples on how to make predictions with the AWS CLI, see [Making predictions with the AWS CLI](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#making-predictions-with-the-aws-cli) in the SageMaker Python SDK documentation.

------

The multi-model endpoint dynamically loads target models as needed. You can observe this when running the [MME Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) as it iterates through random invocations against multiple target models hosted behind a single endpoint. The first request against a given model takes longer because the model has to be downloaded from Amazon Simple Storage Service (Amazon S3) and loaded into memory. This is called a *cold start*, and it is expected on multi-model endpoints to optimize for better price performance for customers. Subsequent calls finish faster because there's no additional overhead after the model has loaded.

**Note**  
For GPU backed instances, the HTTP response code with 507 from the GPU container indicates a lack of memory or other resources. This causes unused models to be unloaded from the container in order to load more frequently used models.

## Retry Requests on ModelNotReadyException Errors
<a name="invoke-multi-model-config-retry"></a>

The first time you call `invoke_endpoint` for a model, the model is downloaded from Amazon Simple Storage Service and loaded into the inference container. This makes the first call take longer to return. Subsequent calls to the same model finish faster, because the model is already loaded.

SageMaker AI returns a response for a call to `invoke_endpoint` within 60 seconds. Some models are too large to download within 60 seconds. If the model does not finish loading before the 60 second timeout limit, the request to `invoke_endpoint` returns with the error code `ModelNotReadyException`, and the model continues to download and load into the inference container for up to 360 seconds. If you get a `ModelNotReadyException` error code for an `invoke_endpoint` request, retry the request. By default, the AWS SDKs for Python (Boto 3) (using [Legacy retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode)) and Java retry `invoke_endpoint` requests that result in `ModelNotReadyException` errors. You can configure the retry strategy to continue retrying the request for up to 360 seconds. If you expect your model to take longer than 60 seconds to download and load into the container, set the SDK socket timeout to 70 seconds. For more information about configuring the retry strategy for the AWS SDK for Python (Boto3), see [Configuring a retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#configuring-a-retry-mode). The following code shows an example that configures the retry strategy to retry calls to `invoke_endpoint` for up to 180 seconds.

```
import boto3
from botocore.config import Config

# This example retry strategy sets the retry attempts to 2. 
# With this setting, the request can attempt to download and/or load the model 
# for upto 180 seconds: 1 orginal request (60 seconds) + 2 retries (120 seconds)
config = Config(
    read_timeout=70,
    retries={
        'max_attempts': 2  # This value can be adjusted to 5 to go up to the 360s max timeout
    }
)
runtime_sagemaker_client = boto3.client('sagemaker-runtime', config=config)
```

# Add or Remove Models
<a name="add-models-to-endpoint"></a>

You can deploy additional models to a multi-model endpoint and invoke them through that endpoint immediately. When adding a new model, you don't need to update or bring down the endpoint, so you avoid the cost of creating and running a separate endpoint for each new model. The process for adding and removing models is the same for CPU and GPU-backed multi-model endpoints.

 SageMaker AI unloads unused models from the container when the instance is reaching memory capacity and more models need to be downloaded into the container. SageMaker AI also deletes unused model artifacts from the instance storage volume when the volume is reaching capacity and new models need to be downloaded. The first invocation to a newly added model takes longer because the endpoint takes time to download the model from S3 to the container's memory in instance hosting the endpoint

With the endpoint already running, copy a new set of model artifacts to the Amazon S3 location there you store your models.

```
# Add an AdditionalModel to the endpoint and exercise it
aws s3 cp AdditionalModel.tar.gz s3://amzn-s3-demo-bucket/path/to/artifacts/
```

**Important**  
To update a model, proceed as you would when adding a new model. Use a new and unique name. Don't overwrite model artifacts in Amazon S3 because the old version of the model might still be loaded in the containers or on the storage volume of the instances on the endpoint. Invocations to the new model could then invoke the old version of the model. 

Client applications can request predictions from the additional target model as soon as it is stored in S3.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName='<ENDPOINT_NAME>',
                        ContentType='text/csv',
                        TargetModel='AdditionalModel.tar.gz',
                        Body=body)
```

To delete a model from a multi-model endpoint, stop invoking the model from the clients and remove it from the S3 location where model artifacts are stored.

# Build Your Own Container for SageMaker AI Multi-Model Endpoints
<a name="build-multi-model-build-container"></a>

Refer to the following sections for bringing your own container and dependencies to multi-model endpoints.

**Topics**
+ [

## Bring your own dependencies for multi-model endpoints on CPU backed instances
](#build-multi-model-container-cpu)
+ [

## Bring your own dependencies for multi-model endpoints on GPU backed instances
](#build-multi-model-container-gpu)
+ [

## Use the SageMaker AI Inference Toolkit
](#multi-model-inference-toolkit)
+ [

# Custom Containers Contract for Multi-Model Endpoints
](mms-container-apis.md)

## Bring your own dependencies for multi-model endpoints on CPU backed instances
<a name="build-multi-model-container-cpu"></a>

If none of the pre-built container images serve your needs, you can build your own container for use with CPU backed multi-model endpoints.

Custom Amazon Elastic Container Registry (Amazon ECR) images deployed in Amazon SageMaker AI are expected to adhere to the basic contract described in [Custom Inference Code with Hosting Services](your-algorithms-inference-code.md) that govern how SageMaker AI interacts with a Docker container that runs your own inference code. For a container to be capable of loading and serving multiple models concurrently, there are additional APIs and behaviors that must be followed. This additional contract includes new APIs to load, list, get, and unload models, and a different API to invoke models. There are also different behaviors for error scenarios that the APIs need to abide by. To indicate that the container complies with the additional requirements, you can add the following command to your Docker file:

```
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
```

SageMaker AI also injects an environment variable into the container

```
SAGEMAKER_MULTI_MODEL=true
```

If you are creating a multi-model endpoint for a serial inference pipline, your Docker file must have the required labels for both multi-models and serial inference pipelines. For more information about serial information pipelines, see [Run Real-time Predictions with an Inference Pipeline](inference-pipeline-real-time.md).

To help you implement these requirements for a custom container, two libraries are available:
+ [Multi Model Server](https://github.com/awslabs/multi-model-server) is an open source framework for serving machine learning models that can be installed in containers to provide the front end that fulfills the requirements for the new multi-model endpoint container APIs. It provides the HTTP front end and model management capabilities required by multi-model endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and performs inference on a specified loaded model. It also provides a pluggable backend that supports a pluggable custom backend handler where you can implement your own algorithm.
+ [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) is a library that bootstraps Multi Model Server with a configuration and settings that make it compatible with SageMaker AI multi-model endpoints. It also allows you to tweak important performance parameters, such as the number of workers per model, depending on the needs of your scenario. 

## Bring your own dependencies for multi-model endpoints on GPU backed instances
<a name="build-multi-model-container-gpu"></a>

The bring your own container (BYOC) capability on multi-model endpoints with GPU backed instances is not currently supported by the Multi Model Server and SageMaker AI Inference Toolkit libraries.

For creating multi-model endpoints with GPU backed instances, you can use the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). with the [NVIDIA Triton Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). To bring your own dependencies, you can build your own container with the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html) as the base image to your Docker file:

```
FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3
```

**Important**  
Containers with the Triton Inference Server are the only supported containers you can use for GPU backed multi-model endpoints.

## Use the SageMaker AI Inference Toolkit
<a name="multi-model-inference-toolkit"></a>

**Note**  
The SageMaker AI Inference Toolkit is only supported for CPU backed multi-model endpoints. The SageMaker AI Inference Toolkit is not currently not supported for GPU backed multi-model endpoints.

Pre-built containers that support multi-model endpoints are listed in [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). If you want to use any other framework or algorithm, you need to build a container. The easiest way to do this is to use the [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) to extend an existing pre-built container. The SageMaker AI inference toolkit is an implementation for the multi-model server (MMS) that creates endpoints that can be deployed in SageMaker AI. For a sample notebook that shows how to set up and deploy a custom container that supports multi-model endpoints in SageMaker AI, see the [Multi-Model Endpoint BYOC Sample Notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_bring_your_own).

**Note**  
The SageMaker AI inference toolkit supports only Python model handlers. If you want to implement your handler in any other language, you must build your own container that implements the additional multi-model endpoint APIs. For information, see [Custom Containers Contract for Multi-Model Endpoints](mms-container-apis.md).

**To extend a container by using the SageMaker AI inference toolkit**

1. Create a model handler. MMS expects a model handler, which is a Python file that implements functions to pre-process, get preditions from the model, and process the output in a model handler. For an example of a model handler, see [model\$1handler.py](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/model_handler.py) from the sample notebook.

1. Import the inference toolkit and use its `model_server.start_model_server` function to start MMS. The following example is from the `dockerd-entrypoint.py` file from the sample notebook. Notice that the call to `model_server.start_model_server` passes the model handler described in the previous step:

   ```
   import subprocess
   import sys
   import shlex
   import os
   from retrying import retry
   from subprocess import CalledProcessError
   from sagemaker_inference import model_server
   
   def _retry_if_error(exception):
       return isinstance(exception, CalledProcessError or OSError)
   
   @retry(stop_max_delay=1000 * 50,
          retry_on_exception=_retry_if_error)
   def _start_mms():
       # by default the number of workers per model is 1, but we can configure it through the
       # environment variable below if desired.
       # os.environ['SAGEMAKER_MODEL_SERVER_WORKERS'] = '2'
       model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')
   
   def main():
       if sys.argv[1] == 'serve':
           _start_mms()
       else:
           subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
   
       # prevent docker exit
       subprocess.call(['tail', '-f', '/dev/null'])
       
   main()
   ```

1. In your `Dockerfile`, copy the model handler from the first step and specify the Python file from the previous step as the entrypoint in your `Dockerfile`. The following lines are from the [Dockerfile](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/Dockerfile) used in the sample notebook:

   ```
   # Copy the default custom service file to handle incoming data and inference requests
   COPY model_handler.py /home/model-server/model_handler.py
   
   # Define an entrypoint script for the docker image
   ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
   ```

1. Build and register your container. The following shell script from the sample notebook builds the container and uploads it to an Amazon Elastic Container Registry repository in your AWS account:

   ```
   %%sh
   
   # The name of our algorithm
   algorithm_name=demo-sagemaker-multimodel
   
   cd container
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   region=${region:-us-west-2}
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   
   if [ $? -ne 0 ]
   then
       aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Get the login command from ECR and execute it directly
   $(aws ecr get-login --region ${region} --no-include-email)
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -q -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

You can now use this container to deploy multi-model endpoints in SageMaker AI.

**Topics**
+ [

## Bring your own dependencies for multi-model endpoints on CPU backed instances
](#build-multi-model-container-cpu)
+ [

## Bring your own dependencies for multi-model endpoints on GPU backed instances
](#build-multi-model-container-gpu)
+ [

## Use the SageMaker AI Inference Toolkit
](#multi-model-inference-toolkit)
+ [

# Custom Containers Contract for Multi-Model Endpoints
](mms-container-apis.md)

# Custom Containers Contract for Multi-Model Endpoints
<a name="mms-container-apis"></a>

To handle multiple models, your container must support a set of APIs that enable Amazon SageMaker AI to communicate with the container for loading, listing, getting, and unloading models as required. The `model_name` is used in the new set of APIs as the key input parameter. The customer container is expected to keep track of the loaded models using `model_name` as the mapping key. Also, the `model_name` is an opaque identifier and is not necessarily the value of the `TargetModel` parameter passed into the `InvokeEndpoint` API. The original `TargetModel` value in the `InvokeEndpoint` request is passed to container in the APIs as a `X-Amzn-SageMaker-Target-Model` header that can be used for logging purposes.

**Note**  
Multi-model endpoints for GPU backed instances are currently supported only with SageMaker AI's [NVIDIA Triton Inference Server container](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container already implements the contract defined below. Customers can directly use this container with their multi-model GPU endpoints, without any additional work.

You can configure the following APIs on your containers for CPU backed multi-model endpoints.

**Topics**
+ [

## Load Model API
](#multi-model-api-load-model)
+ [

## List Model API
](#multi-model-api-list-model)
+ [

## Get Model API
](#multi-model-api-get-model)
+ [

## Unload Model API
](#multi-model-api-unload-model)
+ [

## Invoke Model API
](#multi-model-api-invoke-model)

## Load Model API
<a name="multi-model-api-load-model"></a>

Instructs the container to load a particular model present in the `url` field of the body into the memory of the customer container and to keep track of it with the assigned `model_name`. After a model is loaded, the container should be ready to serve inference requests using this `model_name`.

```
POST /models HTTP/1.1
Content-Type: application/json
Accept: application/json

{
     "model_name" : "{model_name}",
     "url" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is already loaded, this API should return 409. Any time a model cannot be loaded due to lack of memory or to any other resource, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

## List Model API
<a name="multi-model-api-list-model"></a>

Returns the list of models loaded into the memory of the customer container.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

This API also supports pagination.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

SageMaker AI can initially call the List Models API without providing a value for `next_page_token`. If a `nextPageToken` field is returned as part of the response, it will be provided as the value for `next_page_token` in a subsequent List Models call. If a `nextPageToken` is not returned, it means that there are no more models to return.

## Get Model API
<a name="multi-model-api-get-model"></a>

This is a simple read API on the `model_name` entity.

```
GET /models/{model_name} HTTP/1.1
Accept: application/json

{
     "modelName" : "{model_name}",
     "modelUrl" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Unload Model API
<a name="multi-model-api-unload-model"></a>

Instructs the SageMaker AI platform to instruct the customer container to unload a model from memory. This initiates the eviction of a candidate model as determined by the platform when starting the process of loading a new model. The resources provisioned to `model_name` should be reclaimed by the container when this API returns a response.

```
DELETE /models/{model_name}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Invoke Model API
<a name="multi-model-api-invoke-model"></a>

Makes a prediction request from the particular `model_name` supplied. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

```
POST /models/{model_name}/invoke HTTP/1.1
Content-Type: ContentType
Accept: Accept
X-Amzn-SageMaker-Custom-Attributes: CustomAttributes
X-Amzn-SageMaker-Target-Model: [relativePath]/{artifactName}.tar.gz
```

**Note**  
If `model_name` is not loaded, this API should return 404.

Additionally, on GPU instances, if `InvokeEndpoint` fails due to a lack of memory or other resources, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

# Multi-Model Endpoint Security
<a name="multi-model-endpoint-security"></a>

Models and data in a multi-model endpoint are co-located on instance storage volume and in container memory. All instances for Amazon SageMaker AI endpoints run on a single tenant container that you own. Only your models can run on your multi-model endpoint. It's your responsibility to manage the mapping of requests to models and to provide access for users to the correct target models. SageMaker AI uses [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) to provide IAM identity-based policies that you use to specify allowed or denied actions and resources and the conditions under which actions are allowed or denied.

By default, an IAM principal with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) permissions on a multi-model endpoint can invoke any model at the address of the S3 prefix defined in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) operation, provided that the IAM Execution Role defined in operation has permissions to download the model. If you need to restrict [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) access to a limited set of models in S3, you can do one of the following:
+ Restrict `InvokeEndpont` calls to specific models hosted at the endpoint by using the `sagemaker:TargetModel` IAM condition key. For example, the following policy allows `InvokeEndpont` requests only when the value of the `TargetModel` field matches one of the specified regular expressions:

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Action": [
                  "sagemaker:InvokeEndpoint"
              ],
              "Effect": "Allow",
              "Resource":
              "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
              "Condition": {
                  "StringLike": {
                      "sagemaker:TargetModel": ["company_a/*", "common/*"]
                  }
              }
          }
      ]
  }
  ```

------

  For information about SageMaker AI condition keys, see [Condition Keys for Amazon SageMaker AI](https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Identity and Access Management User Guide*.
+ Create multi-model endpoints with more restrictive S3 prefixes. 

For more information about how SageMaker AI uses roles to manage access to endpoints and perform operations on your behalf, see [How to use SageMaker AI execution roles](sagemaker-roles.md). Your customers might also have certain data isolation requirements dictated by their own compliance requirements that can be satisfied using IAM identities.

# CloudWatch Metrics for Multi-Model Endpoint Deployments
<a name="multi-model-endpoint-cloudwatch-metrics"></a>

Amazon SageMaker AI provides metrics for endpoints so you can monitor the cache hit rate, the number of models loaded and the model wait times for loading, downloading, and uploading at a multi-model endpoint. Some of the metrics are different for CPU and GPU backed multi-model endpoints, so the following sections describe the Amazon CloudWatch metrics that you can use for each type of multi-model endpoint.

For more information about the metrics, see **Multi-Model Endpoint Model Loading Metrics** and **Multi-Model Endpoint Model Instance Metrics** in [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). Per-model metrics aren't supported. 

## CloudWatch metrics for CPU backed multi-model endpoints
<a name="multi-model-endpoint-cloudwatch-metrics-cpu"></a>

You can monitor the following metrics on CPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

## CloudWatch metrics for GPU multi-model endpoint deployments
<a name="multi-model-endpoint-cloudwatch-metrics-gpu"></a>

You can monitor the following metrics on GPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0‐100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%‐100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUUtilization |  The percentage of GPU units that are used by the containers on an instance. The value can range betweenrange is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers on an instance. The value range is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%‐400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

# Set SageMaker AI multi-model endpoint model caching behavior
<a name="multi-model-caching"></a>

By default, multi-model endpoints cache frequently used models in memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) and on disk to provide low latency inference. The cached models are unloaded and/or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model.

You can change the caching behavior of a multi-model endpoint and explicitly enable or disable model caching by setting the parameter `ModelCacheSetting` when you call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model).

We recommend setting the value of the `ModelCacheSetting` parameter to `Disabled` for use cases that do not benefit from model caching. For example, when a large number of models need to be served from the endpoint but each model is invoked only once (or very infrequently). For such use cases, setting the value of the `ModelCacheSetting` parameter to `Disabled` allows higher transactions per second (TPS) for `invoke_endpoint` requests compared to the default caching mode. Higher TPS in these use cases is because SageMaker AI does the following after the `invoke_endpoint` request:
+ Asynchronously unloads the model from memory and deletes it from disk immediately after it is invoked.
+ Provides higher concurrency for downloading and loading models in the inference container. For both CPU and GPU backed endpoints, the concurrency is a factor of the number of the vCPUs of the container instance.

For guidelines on choosing a SageMaker AI ML instance type for a multi-model endpoint, see [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md).

# Set Auto Scaling Policies for Multi-Model Endpoint Deployments
<a name="multi-model-endpoints-autoscaling"></a>

SageMaker AI multi-model endpoints fully support automatic scaling, which manages replicas of models to ensure models scale based on traffic patterns. We recommend that you configure your multi-model endpoint and the size of your instances based on [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md) and also set up instance based auto scaling for your endpoint. The invocation rates used to trigger an auto-scale event are based on the aggregate set of predictions across the full set of models served by the endpoint. For additional details on setting up endpoint auto scaling, see [Automatically Scale Amazon SageMaker AI Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

You can set up auto scaling policies with predefined and custom metrics on both CPU and GPU backed multi-model endpoints.

**Note**  
SageMaker AI multi-model endpoint metrics are available at one-minute granularity.

## Define a scaling policy
<a name="multi-model-endpoints-autoscaling-define"></a>

To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

Scaling policy configuration is represented by a JSON block. You save your scaling policy configuration as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following options are available for defining a target-tracking scaling policy configuration.

### Use a predefined metric
<a name="multi-model-endpoints-autoscaling-predefined"></a>

To quickly define a target-tracking scaling policy for a variant, use the `SageMakerVariantInvocationsPerInstance` predefined metric. `SageMakerVariantInvocationsPerInstance` is the average number of times per minute that each instance for a variant is invoked. We strongly recommend using this metric.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a `PredefinedMetricSpecification` for the predefined metric and a `TargetValue` for the target value of that metric.

The following example is a typical policy configuration for target-tracking scaling for a variant. In this configuration, we use the `SageMakerVariantInvocationsPerInstance` predefined metric to adjust the number of variant instances so that each instance has an `InvocationsPerInstance` metric of `70`.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "InvocationsPerInstance"
    }
}
```

**Note**  
We recommend that you use `InvocationsPerInstance` while using multi-model endpoints. The `TargetValue` for this metric depends on your application’s latency requirements. We also recommend that you load test your endpoints to set up suitable scaling parameter values. To learn more about load testing and setting up autoscaling for your endpoints, see the blog [Configuring autoscaling inference endpoints in Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/).

### Use a custom metric
<a name="multi-model-endpoints-autoscaling-custom"></a>

If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.

Not all SageMaker AI metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.

**Important**  
Before deploying automatic scaling in production, you must test automatic scaling with your custom metric.

#### Example custom metric for a CPU backed multi-model endpoint
<a name="multi-model-endpoints-autoscaling-custom-cpu"></a>

The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `CPUUtilization` adjusts the instance count on the endpoint based on an average CPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

#### Example custom metric for a GPU backed multi-model endpoint
<a name="multi-model-endpoints-autoscaling-custom-gpu"></a>

The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `GPUUtilization` adjusts the instance count on the endpoint based on an average GPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "GPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

## Add a cooldown period
<a name="multi-model-endpoints-autoscaling-cooldown"></a>

To add a cooldown period for scaling out your endpoint, specify a value, in seconds, for `ScaleOutCooldown`. Similarly, to add a cooldown period for scaling in your model, add a value, in seconds, for `ScaleInCooldown`. For more information about `ScaleInCooldown` and `ScaleOutCooldown`, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following is an example target-tracking configuration for a scaling policy. In this configuration, the `SageMakerVariantInvocationsPerInstance` predefined metric is used to adjust scaling based on an average of `70` across all instances of that variant. The configuration provides a scale-in cooldown period of 10 minutes and a scale-out cooldown period of 5 minutes.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```

# Multi-container endpoints
<a name="multi-container-endpoints"></a>

SageMaker AI multi-container endpoints enable customers to deploy multiple containers, that use different models or frameworks, on a single SageMaker AI endpoint. The containers can be run in a sequence as an inference pipeline, or each container can be accessed individually by using direct invocation to improve endpoint utilization and optimize costs.

For information about invoking the containers in a multi-container endpoint in sequence, see [Inference pipelines in Amazon SageMaker AI](inference-pipelines.md).

For information about invoking a specific container in a multi-container endpoint, see [Invoke a multi-container endpoint with direct invocation](multi-container-direct.md)

**Topics**
+ [

# Create a multi-container endpoint (Boto 3)
](multi-container-create.md)
+ [

# Update a multi-container endpoint
](multi-container-update.md)
+ [

# Invoke a multi-container endpoint with direct invocation
](multi-container-direct.md)
+ [

# Security with multi-container endpoints with direct invocation
](multi-container-security.md)
+ [

# Metrics for multi-container endpoints with direct invocation
](multi-container-metrics.md)
+ [

# Autoscale multi-container endpoints
](multi-container-auto-scaling.md)
+ [

# Troubleshoot multi-container endpoints
](multi-container-troubleshooting.md)

# Create a multi-container endpoint (Boto 3)
<a name="multi-container-create"></a>

Create a Multi-container endpoint by calling [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html), and [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) APIs as you would to create any other endpoints. You can run these containers sequentially as an inference pipeline, or run each individual container by using direct invocation. Multi-container endpoints have the following requirements when you call `create_model`:
+ Use the `Containers` parameter instead of `PrimaryContainer`, and include more than one container in the `Containers` parameter.
+ The `ContainerHostname` parameter is required for each container in a multi-container endpoint with direct invocation.
+ Set the `Mode` parameter of the `InferenceExecutionConfig` field to `Direct` for direct invocation of each container, or `Serial` to use containers as an inference pipeline. The default mode is `Serial`. 

**Note**  
Currently there is a limit of up to 15 containers supported on a multi-container endpoint.

The following example creates a multi-container model for direct invocation.

1. Create container elements and `InferenceExecutionConfig` with direct invocation.

   ```
   container1 = {
                    'Image': '123456789012.dkr.ecr.us-east-1.amazonaws.com/myimage1:mytag',
                    'ContainerHostname': 'firstContainer'
                }
   
   container2 = {
                    'Image': '123456789012.dkr.ecr.us-east-1.amazonaws.com/myimage2:mytag',
                    'ContainerHostname': 'secondContainer'
                }
   inferenceExecutionConfig = {'Mode': 'Direct'}
   ```

1. Create the model with the container elements and set the `InferenceExecutionConfig` field.

   ```
   import boto3
   sm_client = boto3.Session().client('sagemaker')
   
   response = sm_client.create_model(
                  ModelName = 'my-direct-mode-model-name',
                  InferenceExecutionConfig = inferenceExecutionConfig,
                  ExecutionRoleArn = role,
                  Containers = [container1, container2]
              )
   ```

To create an endoint, you would then call [create\$1endpoint\$1config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) as you would to create any other endpoint.

# Update a multi-container endpoint
<a name="multi-container-update"></a>

To update an Amazon SageMaker AI multi-container endpoint, complete the following steps.

1.  Call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) to create a new model with a new value for the `Mode` parameter in the `InferenceExecutionConfig` field.

1.  Call [create\$1endpoint\$1config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) to create a new endpoint config with a different name by using the new model you created in the previous step.

1.  Call [update\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_endpoint) to update the endpoint with the new endpoint config you created in the previous step. 

# Invoke a multi-container endpoint with direct invocation
<a name="multi-container-direct"></a>

SageMaker AI multi-container endpoints enable customers to deploy multiple containers to deploy different models on a SageMaker AI endpoint. You can host up to 15 different inference containers on a single endpoint. By using direct invocation, you can send a request to a specific inference container hosted on a multi-container endpoint.

 To invoke a multi-container endpoint with direct invocation, call [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) as you would invoke any other endpoint, and specify which container you want to invoke by using the `TargetContainerHostname` parameter.

 

 The following example directly invokes the `secondContainer` of a multi-container endpoint to get a prediction.

```
import boto3
runtime_sm_client = boto3.Session().client('sagemaker-runtime')

response = runtime_sm_client.invoke_endpoint(
   EndpointName ='my-endpoint',
   ContentType = 'text/csv',
   TargetContainerHostname='secondContainer', 
   Body = body)
```

 For each direct invocation request to a multi-container endpoint, only the container with the `TargetContainerHostname` processes the invocation request. You will get validation errors if you do any of the following:
+ Specify a `TargetContainerHostname` that does not exist in the endpoint
+ Do not specify a value for `TargetContainerHostname` in a request to an endpoint configured for direct invocation
+ Specify a value for `TargetContainerHostname` in a request to an endpoint that is not configured for direct invocation.

# Security with multi-container endpoints with direct invocation
<a name="multi-container-security"></a>

 For multi-container endpoints with direct invocation, there are multiple containers hosted in a single instance by sharing memory and a storage volume. It's your responsibility to use secure containers, maintain the correct mapping of requests to target containers, and provide users with the correct access to target containers. SageMaker AI uses IAM roles to provide IAM identity-based policies that you use to specify whether access to a resource is allowed or denied to that role, and under what conditions. For information about IAM roles, see [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) in the *AWS Identity and Access Management User Guide*. For information about identity-based policies, see [Identity-based policies and resource-based policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html).

By default, an IAM principal with `InvokeEndpoint` permissions on a multi-container endpoint with direct invocation can invoke any container inside the endpoint with the endpoint name that you specify when you call `invoke_endpoint`. If you need to restrict `invoke_endpoint` access to a limited set of containers inside a multi-container endpoint, use the `sagemaker:TargetContainerHostname` IAM condition key. The following policies show how to limit calls to specific containers within an endpoint.

The following policy allows `invoke_endpoint` requests only when the value of the `TargetContainerHostname` field matches one of the specified regular expressions.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "customIps*",
                        "common*"
                    ]
                }
            }
        }
    ]
}
```

------

The following policy denies `invoke_endpoint` requests when the value of the `TargetContainerHostname` field matches one of the specified regular expressions in the `Deny` statement.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "model_name*"
                    ]
                }
            }
        },
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Deny",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "special-model_name*"
                    ]
                }
            }
        }
    ]
}
```

------

 For information about SageMaker AI condition keys, see [Condition Keys for SageMaker AI](https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Identity and Access Management User Guide*.

# Metrics for multi-container endpoints with direct invocation
<a name="multi-container-metrics"></a>

In addition to the endpoint metrics that are listed in [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md), SageMaker AI also provides per-container metrics.

Per-container metrics for multi-container endpoints with direct invocation are located in CloudWatch and categorized into two namespaces: `AWS/SageMaker` and `aws/sagemaker/Endpoints`. The `AWS/SageMaker` namespace includes invocation-related metrics, and the `aws/sagemaker/Endpoints` namespace includes memory and CPU utilization metrics.

The following table lists the per-container metrics for multi-container endpoints with direct invocation. All the metrics use the [`EndpointName, VariantName, ContainerName`] dimension, which filters metrics at a specific endpoint, for a specific variant and corresponding to a specific container. These metrics share the same metric names as in those for inference pipelines, but at a per-container level [`EndpointName, VariantName, ContainerName`].

 


|  |  |  |  | 
| --- |--- |--- |--- |
|  Metric Name  |  Description  |  Dimension  |  NameSpace  | 
|  Invocations  |  The number of InvokeEndpoint requests sent to a container inside an endpoint. To get the total number of requests sent to that container, use the Sum statistic. Units: None Valid statistics: Sum, Sample Count |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  Invocation4XX Errors  |  The number of InvokeEndpoint requests that the model returned a 4xx HTTP response code for on a specific container. For each 4xx response, SageMaker AI sends a 1. Units: None Valid statistics: Average, Sum  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  Invocation5XX Errors  |  The number of InvokeEndpoint requests that the model returned a 5xx HTTP response code for on a specific container. For each 5xx response, SageMaker AI sends a 1. Units: None Valid statistics: Average, Sum  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  ContainerLatency  |  The time it took for the target container to respond as viewed from SageMaker AI. ContainerLatency includes the time it took to send the request, to fetch the response from the model's container, and to complete inference in the container. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  OverheadLatency  |  The time added to the time taken to respond to a client request by SageMaker AI for overhead. OverheadLatency is measured from the time that SageMaker AI receives the request until it returns a response to the client, minus theModelLatency. Overhead latency can vary depending on request and response payload sizes, request frequency, and authentication or authorization of the request, among other factors. Units: Microseconds Valid statistics: Average, Sum, Min, Max, `Sample Count `  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  CPUUtilization  | The percentage of CPU units that are used by each container running on an instance. The value ranges from 0% to 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to 400%. For endpoints with direct invocation, the number of CPUUtilization metrics equals the number of containers in that endpoint. Units: Percent  |  EndpointName, VariantName, ContainerName  | aws/sagemaker/Endpoints | 
|  MemoryUtilizaton  |  The percentage of memory that is used by each container running on an instance. This value ranges from 0% to 100%. Similar as CPUUtilization, in endpoints with direct invocation, the number of MemoryUtilization metrics equals the number of containers in that endpoint. Units: Percent  |  EndpointName, VariantName, ContainerName  | aws/sagemaker/Endpoints | 

All the metrics in the previous table are specific to multi-container endpoints with direct invocation. Besides these special per-container metrics, there are also metrics at the variant level with dimension `[EndpointName, VariantName]` for all the metrics in the table expect `ContainerLatency`.

# Autoscale multi-container endpoints
<a name="multi-container-auto-scaling"></a>

If you want to configure automatic scaling for a multi-container endpoint using the `InvocationsPerInstance` metric, we recommend that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint does not scale out and there may not be enough instances to handle all the requests to the high CPU utilization model. For information about automatically scaling endpoints, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md).

# Troubleshoot multi-container endpoints
<a name="multi-container-troubleshooting"></a>

The following sections can help you troubleshoot errors with multi-container endpoints.

## Ping Health Check Errors
<a name="multi-container-ping-errors"></a>

 With multiple containers, endpoint memory and CPU are under higher pressure during endpoint creation. Specifically, the `MemoryUtilization` and `CPUUtilization` metrics are higher than for single-container endpoints, because utilization pressure is proportional to the number of containers. Because of this, we recommend that you choose instance types with enough memory and CPU to ensure that there is enough memory on the instance to have all the models loaded (the same guidance applies to deploying an inference pipeline). Otherwise, your endpoint creation might fail with an error such as `XXX did not pass the ping health check`.

## Missing accept-bind-to-port=true Docker label
<a name="multi-container-missing-accept"></a>

The containers in a multi-container endpoints listen on the port specified in the `SAGEMAKER_BIND_TO_PORT` environment variable instead of port 8080. When a container runs in a multi-container endpoint, SageMaker AI automatically provides this environment variable to the container. If this environment variable isn't present, containers default to using port 8080. To indicate that your container complies with this requirement, use the following command to add a label to your Dockerfile: 

```
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
```

 Otherwise, You will see an error message such as `Your Ecr Image XXX does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).`

 If your container needs to listen on a second port, choose a port in the range specified by the `SAGEMAKER_SAFE_PORT_RANGE` environment variable. Specify the value as an inclusive range in the format *XXXX*-*YYYY*, where XXXX and YYYY are multi-digit integers. SageMaker AI provides this value automatically when you run the container in a multi-container endpoint. 

# Inference pipelines in Amazon SageMaker AI
<a name="inference-pipelines"></a>

An *inference pipeline* is a Amazon SageMaker AI model that is composed of a linear sequence of two to fifteen containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained SageMaker AI built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

You can add SageMaker AI Spark ML Serving and scikit-learn containers that reuse the data transformers developed for training models. The entire assembled inference pipeline can be considered as a SageMaker AI model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing. 

Within an inference pipeline model, SageMaker AI handles invocations as a sequence of HTTP requests. The first container in the pipeline handles the initial request, then the intermediate response is sent as a request to the second container, and so on, for each container in the pipeline. SageMaker AI returns the final response to the client. 

When you deploy the pipeline model, SageMaker AI installs and runs all of the containers on each Amazon Elastic Compute Cloud (Amazon EC2) instance in the endpoint or transform job. Feature processing and inferences run with low latency because the containers are co-located on the same EC2 instances. You define the containers for a pipeline model using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) operation or from the console. Instead of setting one `PrimaryContainer`, you use the `Containers` parameter to set the containers that make up the pipeline. You also specify the order in which the containers are executed. 

A pipeline model is immutable, but you can update an inference pipeline by deploying a new one using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) operation. This modularity supports greater flexibility during experimentation. 

For information on how to create an inference pipeline with the SageMaker Model Registry, see [Model Registration Deployment with Model Registry](model-registry.md).

There are no additional costs for using this feature. You pay only for the instances running on an endpoint.

**Topics**
+ [

## Sample Notebooks for Inference Pipelines
](#inference-pipeline-sample-notebooks)
+ [

# Feature Processing with Spark ML and Scikit-learn
](inference-pipeline-mleap-scikit-learn-containers.md)
+ [

# Create a Pipeline Model
](inference-pipeline-create-console.md)
+ [

# Run Real-time Predictions with an Inference Pipeline
](inference-pipeline-real-time.md)
+ [

# Batch transforms with inference pipelines
](inference-pipeline-batch.md)
+ [

# Inference Pipeline Logs and Metrics
](inference-pipeline-logs-metrics.md)
+ [

# Troubleshoot Inference Pipelines
](inference-pipeline-troubleshoot.md)

## Sample Notebooks for Inference Pipelines
<a name="inference-pipeline-sample-notebooks"></a>

For an example that shows how to create and deploy inference pipelines, see the [Inference Pipeline with Scikit-learn and Linear Learner](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-python-sdk/scikit_learn_inference_pipeline) sample notebook. For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). 

To see a list of all the SageMaker AI samples, after creating and opening a notebook instance, choose the **SageMaker AI Examples** tab. There are three inference pipeline notebooks. The first two inference pipeline notebooks just described are located in the `advanced_functionality` folder and the third notebook is in the `sagemaker-python-sdk` folder. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# Feature Processing with Spark ML and Scikit-learn
<a name="inference-pipeline-mleap-scikit-learn-containers"></a>

Before training a model with either Amazon SageMaker AI built-in algorithms or custom algorithms, you can use Spark and scikit-learn preprocessors to transform your data and engineer features. 

## Feature Processing with Spark ML
<a name="feature-processing-spark"></a>

You can run Spark ML jobs with [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html), a serverless ETL (extract, transform, load) service, from your SageMaker AI notebook. You can also connect to existing EMR clusters to run Spark ML jobs with [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). To do this, you need an AWS Identity and Access Management (IAM) role that grants permission for making calls from your SageMaker AI notebook to AWS Glue. 

**Note**  
To see which Python and Spark versions AWS Glue supports, refer to [AWS Glue Release Notes](/glue/latest/dg/release-notes.html).

After engineering features, you package and serialize Spark ML jobs with MLeap into MLeap containers that you can add to an inference pipeline. You don't need to use externally managed Spark clusters. With this approach, you can seamlessly scale from a sample of rows to terabytes of data. The same transformers work for both training and inference, so you don't need to duplicate preprocessing and feature engineering logic or develop a one-time solution to make the models persist. With inference pipelines, you don't need to maintain outside infrastructure, and you can make predictions directly from data inputs.

When you run a Spark ML job on AWS Glue, a Spark ML pipeline is serialized into [MLeap](https://github.com/combust/mleap) format. Then, you can use the job with the [SparkML Model Serving Container](https://github.com/aws/sagemaker-sparkml-serving-container) in a SageMaker AI Inference Pipeline. *MLeap* is a serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn, and TensorFlow for training pipelines and exporting them to a serialized pipeline called an MLeap Bundle. You can deserialize Bundles back into Spark for batch-mode scoring or into the MLeap runtime to power real-time API services. 

For an example that shows how to feature process with Spark ML, see the [Train an ML Model using Apache Spark in Amazon EMR and deploy in SageMaker AI](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-python-sdk/sparkml_serving_emr_mleap_abalone) sample notebook.

## Feature Processing with Scikit-Learn
<a name="feature-processing-with-scikit"></a>

You can run and package scikit-learn jobs into containers directly in Amazon SageMaker AI. For an example of Python code for building a scikit-learn featurizer model that trains on [Fisher's Iris flower data set](http://archive.ics.uci.edu/ml/datasets/Iris) and predicts the species of Iris based on morphological measurements, see [IRIS Training and Prediction with Sagemaker Scikit-learn](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_iris). 

# Create a Pipeline Model
<a name="inference-pipeline-create-console"></a>

To create a pipeline model that can be deployed to an endpoint or used for a batch transform job, use the Amazon SageMaker AI console or the `CreateModel` operation. 

**To create an inference pipeline (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Models**, and then choose **Create models** from the **Inference** group. 

1. On the **Create model** page, provide a model name, choose an IAM role, and, if you want to use a private VPC, specify VPC values.   
![\[The page for creating a model for an Inference Pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/create-pipeline-model.png)

1. To add information about the containers in the inference pipeline, choose **Add container**, then choose **Next**.

1. Complete the fields for each container in the order that you want to execute them, up to the maximum of fifteen. Complete the **Container input options**, , **Location of inference code image**, and, optionally, **Location of model artifacts**, **Container host name**, and **Environmental variables** fields. .  
![\[Creating a pipeline model with containers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/create-pipeline-model-containers.png)

   The **MyInferencePipelineModel** page summarizes the settings for the containers that provide input for the model. If you provided the environment variables in a corresponding container definition, SageMaker AI shows them in the **Environment variables** field.  
![\[The summary of container settings for the pipeline model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-MyInferencePipelinesModel-recap.png)

# Run Real-time Predictions with an Inference Pipeline
<a name="inference-pipeline-real-time"></a>

You can use trained models in an inference pipeline to make real-time predictions directly without performing external preprocessing. When you configure the pipeline, you can choose to use the built-in feature transformers already available in Amazon SageMaker AI. Or, you can implement your own transformation logic using just a few lines of scikit-learn or Spark code. 

[MLeap](https://combust.github.io/mleap-docs/), a serialization format and execution engine for machine learning pipelines, supports Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to a serialized pipeline called an MLeap Bundle. You can deserialize Bundles back into Spark for batch-mode scoring or into the MLeap runtime to power real-time API services.

The containers in a pipeline listen on the port specified in the `SAGEMAKER_BIND_TO_PORT` environment variable (instead of 8080). When running in an inference pipeline, SageMaker AI automatically provides this environment variable to containers. If this environment variable isn't present, containers default to using port 8080. To indicate that your container complies with this requirement, use the following command to add a label to your Dockerfile:

```
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
```

If your container needs to listen on a second port, choose a port in the range specified by the `SAGEMAKER_SAFE_PORT_RANGE` environment variable. Specify the value as an inclusive range in the format **"XXXX-YYYY"**, where `XXXX` and `YYYY` are multi-digit integers. SageMaker AI provides this value automatically when you run the container in a multicontainer pipeline.

**Note**  
To use custom Docker images in a pipeline that includes [SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon Elastic Container Registry (Amazon ECR) policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). Your Amazon ECR repository must grant SageMaker AI permission to pull the image. For more information, see [Troubleshoot Amazon ECR Permissions for Inference Pipelines](inference-pipeline-troubleshoot.md#inference-pipeline-troubleshoot-permissions).

## Create and Deploy an Inference Pipeline Endpoint
<a name="inference-pipeline-real-time-sdk"></a>

The following code creates and deploys a real-time inference pipeline model with SparkML and XGBoost models in series using the SageMaker AI SDK.

```
from sagemaker.model import Model
from sagemaker.pipeline_model import PipelineModel
from sagemaker.sparkml.model import SparkMLModel

sparkml_data = 's3://{}/{}/{}'.format(s3_model_bucket, s3_model_key_prefix, 'model.tar.gz')
sparkml_model = SparkMLModel(model_data=sparkml_data)
xgb_model = Model(model_data=xgb_model.model_data, image=training_image)

model_name = 'serial-inference-' + timestamp_prefix
endpoint_name = 'serial-inference-ep-' + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])
sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
```

## Request Real-Time Inference from an Inference Pipeline Endpoint
<a name="inference-pipeline-endpoint-request"></a>

The following example shows how to make real-time predictions by calling an inference endpoint and passing a request payload in JSON format:

```
import sagemaker
from sagemaker.predictor import json_serializer, json_deserializer, Predictor

payload = {
        "input": [
            {
                "name": "Pclass",
                "type": "float",
                "val": "1.0"
            },
            {
                "name": "Embarked",
                "type": "string",
                "val": "Q"
            },
            {
                "name": "Age",
                "type": "double",
                "val": "48.0"
            },
            {
                "name": "Fare",
                "type": "double",
                "val": "100.67"
            },
            {
                "name": "SibSp",
                "type": "double",
                "val": "1.0"
            },
            {
                "name": "Sex",
                "type": "string",
                "val": "male"
            }
        ],
        "output": {
            "name": "features",
            "type": "double",
            "struct": "vector"
        }
    }

predictor = Predictor(endpoint=endpoint_name, sagemaker_session=sagemaker.Session(), serializer=json_serializer,
                                content_type='text/csv', accept='application/json')

print(predictor.predict(payload))
```

The response you get from `predictor.predict(payload)` is the model's inference result.

## Realtime inference pipeline example
<a name="inference-pipeline-example"></a>

You can run this [example notebook using the SKLearn predictor](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb) that shows how to deploy an endpoint, run an inference request, then deserialize the response. Find this notebook and more examples in the [Amazon SageMaker example GitHub repository](https://github.com/awslabs/amazon-sagemaker-examples).

# Batch transforms with inference pipelines
<a name="inference-pipeline-batch"></a>

To get inferences on an entire dataset you run a batch transform on a trained model. To run inferences on a full dataset, you can use the same inference pipeline model created and deployed to an endpoint for real-time processing in a batch transform job. To run a batch transform job in a pipeline, you download the input data from Amazon S3 and send it in one or more HTTP requests to the inference pipeline model. For an example that shows how to prepare data for a batch transform, see "Section 2 - Preprocess the raw housing data using Scikit Learn" of the [Amazon SageMaker Multi-Model Endpoints using Linear Learner sample notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_linear_learner_home_value). For information about Amazon SageMaker AI batch transforms, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md). 

**Note**  
To use custom Docker images in a pipeline that includes [Amazon SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon Elastic Container Registry (ECR) policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). Your Amazon ECR repository must grant SageMaker AI permission to pull the image. For more information, see [Troubleshoot Amazon ECR Permissions for Inference Pipelines](inference-pipeline-troubleshoot.md#inference-pipeline-troubleshoot-permissions).

The following example shows how to run a transform job using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). In this example, `model_name` is the inference pipeline that combines SparkML and XGBoost models (created in previous examples). The Amazon S3 location specified by `input_data_path` contains the input data, in CSV format, to be downloaded and sent to the Spark ML model. After the transform job has finished, the Amazon S3 location specified by `output_data_path` contains the output data returned by the XGBoost model in CSV format.

```
import sagemaker
input_data_path = 's3://{}/{}/{}'.format(default_bucket, 'key', 'file_name')
output_data_path = 's3://{}/{}'.format(default_bucket, 'key')
transform_job = sagemaker.transformer.Transformer(
    model_name = model_name,
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    strategy = 'SingleRecord',
    assemble_with = 'Line',
    output_path = output_data_path,
    base_transform_job_name='inference-pipelines-batch',
    sagemaker_session=sagemaker.Session(),
    accept = CONTENT_TYPE_CSV)
transform_job.transform(data = input_data_path, 
                        content_type = CONTENT_TYPE_CSV, 
                        split_type = 'Line')
```

# Inference Pipeline Logs and Metrics
<a name="inference-pipeline-logs-metrics"></a>

Monitoring is important for maintaining the reliability, availability, and performance of Amazon SageMaker AI resources. To monitor and troubleshoot inference pipeline performance, use Amazon CloudWatch logs and error messages. For information about the monitoring tools that SageMaker AI provides, see [Monitoring AWS resources in Amazon SageMaker AI](monitoring-overview.md).

## Use Metrics to Monitor Multi-container Models
<a name="inference-pipeline-metrics"></a>

To monitor the multi-container models in Inference Pipelines, use Amazon CloudWatch. CloudWatch collects raw data and processes it into readable, near real-time metrics. SageMaker AI training jobs and endpoints write CloudWatch metrics and logs in the `AWS/SageMaker` namespace. 

The following tables list the metrics and dimensions for the following:
+ Endpoint invocations
+ Training jobs, batch transform jobs, and endpoint instances

A *dimension* is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a metric. For more information on monitoring with CloudWatch, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). 

**Endpoint Invocation Metrics**

The `AWS/SageMaker` namespace includes the following request metrics from calls to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are reported at a 1-minute intervals.


| Metric | Description | 
| --- | --- | 
| Invocation4XXErrors |  The number of `InvokeEndpoint` requests that the model returned a `4xx` HTTP response code for. For each `4xx` response, SageMaker AI sends a `1`. Units: None Valid statistics: `Average`, `Sum`  | 
| Invocation5XXErrors |  The number of `InvokeEndpoint` requests that the model returned a `5xx` HTTP response code for. For each `5xx` response, SageMaker AI sends a `1`. Units: None Valid statistics: `Average`, `Sum`  | 
| Invocations |  The `number of InvokeEndpoint` requests sent to a model endpoint.  To get the total number of requests sent to a model endpoint, use the `Sum` statistic. Units: None Valid statistics: `Sum`, `Sample Count`  | 
| InvocationsPerInstance |  The number of endpoint invocations sent to a model, normalized by `InstanceCount` in each `ProductionVariant`. SageMaker AI sends 1/`numberOfInstances` as the value for each request, where `numberOfInstances` is the number of active instances for the ProductionVariant at the endpoint at the time of the request. Units: None Valid statistics: `Sum`  | 
| ModelLatency | The time the model or models took to respond. This includes the time it took to send the request, to fetch the response from the model container, and to complete the inference in the container. ModelLatency is the total time taken by all containers in an inference pipeline.Units: MicrosecondsValid statistics: `Average`, `Sum`, `Min`, `Max`, Sample Count | 
| OverheadLatency |  The time added to the time taken to respond to a client request by SageMaker AI for overhead. `OverheadLatency` is measured from the time that SageMaker AI receives the request until it returns a response to the client, minus the `ModelLatency`. Overhead latency can vary depending on request and response payload sizes, request frequency, and authentication or authorization of the request, among other factors. Units: Microseconds Valid statistics: `Average`, `Sum`, `Min`, `Max`, `Sample Count`  | 
| ContainerLatency | The time it took for an Inference Pipelines container to respond as viewed from SageMaker AI. ContainerLatency includes the time it took to send the request, to fetch the response from the model's container, and to complete inference in the container.Units: MicrosecondsValid statistics: `Average`, `Sum`, `Min`, `Max`, `Sample Count` | 

**Dimensions for Endpoint Invocation Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName, ContainerName |  Filters endpoint invocation metrics for a `ProductionVariant` at the specified endpoint and for the specified variant.  | 

For an inference pipeline endpoint, CloudWatch lists per-container latency metrics in your account as **Endpoint Container Metrics** and **Endpoint Variant Metrics** in the **SageMaker AI** namespace, as follows. The `ContainerLatency` metric appears only for inferences pipelines.

![\[The CloudWatch dashboard for an inference pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-endpoint-metrics.png)


For each endpoint and each container, latency metrics display names for the container, endpoint, variant, and metric.

![\[The latency metrics for an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-endpoint-metrics-details.png)


**Training Job, Batch Transform Job, and Endpoint Instance Metrics**

The namespaces `/aws/sagemaker/TrainingJobs`, `/aws/sagemaker/TransformJobs`, and `/aws/sagemaker/Endpoints` include the following metrics for training jobs and endpoint instances.

Metrics are reported at a 1-minute intervals.


| Metric | Description | 
| --- | --- | 
| CPUUtilization |  The percentage of CPU units that are used by the containers running on an instance. The value ranges from 0% to 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, `CPUUtilization` can range from 0% to 400%. For training jobs, `CPUUtilization` is the CPU utilization of the algorithm container running on the instance. For batch transform jobs, `CPUUtilization` is the CPU utilization of the transform container running on the instance. For multi-container models, `CPUUtilization` is the sum of CPU utilization by all containers running on the instance. For endpoint variants, `CPUUtilization` is the sum of CPU utilization by all of the containers running on the instance. Units: Percent  | 
| MemoryUtilization | The percentage of memory that is used by the containers running on an instance. This value ranges from 0% to 100%.For training jobs, `MemoryUtilization` is the memory used by the algorithm container running on the instance.For batch transform jobs, `MemoryUtilization` is the memory used by the transform container running on the instance.For multi-container models, MemoryUtilization is the sum of memory used by all containers running on the instance.For endpoint variants, `MemoryUtilization` is the sum of memory used by all of the containers running on the instance.Units: Percent | 
| GPUUtilization |  The percentage of GPU units that are used by the containers running on an instance. `GPUUtilization` ranges from 0% to 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from 0% to 400%. For training jobs, `GPUUtilization` is the GPU used by the algorithm container running on the instance. For batch transform jobs, `GPUUtilization` is the GPU used by the transform container running on the instance. For multi-container models, `GPUUtilization` is the sum of GPU used by all containers running on the instance. For endpoint variants, `GPUUtilization` is the sum of GPU used by all of the containers running on the instance. Units: Percent  | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers running on an instance. GPUMemoryUtilization ranges from 0% to 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUMemoryUtilization` can range from 0% to 400%. For training jobs, `GPUMemoryUtilization` is the GPU memory used by the algorithm container running on the instance. For batch transform jobs, `GPUMemoryUtilization` is the GPU memory used by the transform container running on the instance. For multi-container models, `GPUMemoryUtilization` is sum of GPU used by all containers running on the instance. For endpoint variants, `GPUMemoryUtilization` is the sum of the GPU memory used by all of the containers running on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers running on an instance. DiskUtilization ranges from 0% to 100%. This metric is not supported for batch transform jobs. For training jobs, `DiskUtilization` is the disk space used by the algorithm container running on the instance. For endpoint variants, `DiskUtilization` is the sum of the disk space used by all of the provided containers running on the instance. Units: Percent  | 

**Dimensions for Training Job, Batch Transform Job, and Endpoint Instance Metrics**


| Dimension | Description | 
| --- | --- | 
| Host |  For training jobs, `Host` has the format `[training-job-name]/algo-[instance-number-in-cluster]`. Use this dimension to filter instance metrics for the specified training job and instance. This dimension format is present only in the `/aws/sagemaker/TrainingJobs` namespace. For batch transform jobs, `Host` has the format `[transform-job-name]/[instance-id]`. Use this dimension to filter instance metrics for the specified batch transform job and instance. This dimension format is present only in the `/aws/sagemaker/TransformJobs` namespace. For endpoints, `Host` has the format `[endpoint-name]/[ production-variant-name ]/[instance-id]`. Use this dimension to filter instance metrics for the specified endpoint, variant, and instance. This dimension format is present only in the `/aws/sagemaker/Endpoints` namespace.  | 

To help you debug your training jobs, endpoints, and notebook instance lifecycle configurations, SageMaker AI also sends anything an algorithm container, a model container, or a notebook instance lifecycle configuration sends to `stdout` or `stderr` to Amazon CloudWatch Logs. You can use this information for debugging and to analyze progress.

## Use Logs to Monitor an Inference Pipeline
<a name="inference-pipeline-logs"></a>

The following table lists the log groups and log streams SageMaker AI. sends to Amazon CloudWatch 

A *log stream* is a sequence of log events that share the same source. Each separate source of logs into CloudWatch makes up a separate log stream. A *log group* is a group of log streams that share the same retention, monitoring, and access control settings.

**Logs**

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-logs-metrics.html)

**Note**  
SageMaker AI creates the `/aws/sagemaker/NotebookInstances` log group when you create a notebook instance with a lifecycle configuration. For more information, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md).

For more information about SageMaker AI logging, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md). 

# Troubleshoot Inference Pipelines
<a name="inference-pipeline-troubleshoot"></a>

To troubleshoot inference pipeline issues, use CloudWatch logs and error messages. If you are using custom Docker images in a pipeline that includes Amazon SageMaker AI built-in algorithms, you might also encounter permissions problems. To grant the required permissions, create an Amazon Elastic Container Registry (Amazon ECR) policy.

**Topics**
+ [

## Troubleshoot Amazon ECR Permissions for Inference Pipelines
](#inference-pipeline-troubleshoot-permissions)
+ [

## Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines
](#inference-pipeline-troubleshoot-logs)
+ [

## Use Error Messages to Troubleshoot Inference Pipelines
](#inference-pipeline-troubleshoot-errors)

## Troubleshoot Amazon ECR Permissions for Inference Pipelines
<a name="inference-pipeline-troubleshoot-permissions"></a>

When you use custom Docker images in a pipeline that includes [SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon ECR policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). The policy allows your Amazon ECR repository to grant permission for SageMaker AI to pull the image. The policy must add the following permissions:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "allowSageMakerToPull",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": [
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:BatchCheckLayerAvailability"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines
<a name="inference-pipeline-troubleshoot-logs"></a>

SageMaker AI publishes the container logs for endpoints that deploy an inference pipeline to Amazon CloudWatch at the following path for each container.

```
/aws/sagemaker/Endpoints/{EndpointName}/{Variant}/{InstanceId}/{ContainerHostname}
```

For example, logs for this endpoint are published to the following log groups and streams:

```
EndpointName: MyInferencePipelinesEndpoint
Variant: MyInferencePipelinesVariant
InstanceId: i-0179208609ff7e488
ContainerHostname: MyContainerName1 and MyContainerName2
```

```
logGroup: /aws/sagemaker/Endpoints/MyInferencePipelinesEndpoint
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName1
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName2
```

A *log stream* is a sequence of log events that share the same source. Each separate source of logs into CloudWatch makes up a separate log stream. A *log group* is a group of log streams that share the same retention, monitoring, and access control settings.

**To see the log groups and streams**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation page, choose **Logs**.

1. In **Log Groups**. filter on **MyInferencePipelinesEndpoint**:   
![\[The CloudWatch log groups filtered for the inference pipeline endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-log-group-filter.png)

1. To see the log streams, on the CloudWatch **Log Groups** page, choose **MyInferencePipelinesEndpoint**, and then **Search Log Group**.  
![\[The CloudWatch log stream for the inference pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-log-streams-2.png)

For a list of the logs that SageMaker AI publishes, see [Inference Pipeline Logs and Metrics](inference-pipeline-logs-metrics.md).

## Use Error Messages to Troubleshoot Inference Pipelines
<a name="inference-pipeline-troubleshoot-errors"></a>

The inference pipeline error messages indicate which containers failed. 

If an error occurs while SageMaker AI is invoking an endpoint, the service returns a `ModelError` (error code 424), which indicates which container failed. If the request payload (the response from the previous container) exceeds the limit of 5 MB, SageMaker AI provides a detailed error message, such as: 

Received response from MyContainerName1 with status code 200. However, the request payload from MyContainerName1 to MyContainerName2 is 6000000 bytes, which has exceeded the maximum limit of 5 MB.

``

If a container fails the ping health check while SageMaker AI is creating an endpoint, it returns a `ClientError` and indicates all of the containers that failed the ping check in the last health check.

# Delete Endpoints and Resources
<a name="realtime-endpoints-delete-resources"></a>

Delete endpoints to stop incurring charges.

## Delete Endpoint
<a name="realtime-endpoints-delete-endpoint"></a>

Delete your endpoint programmatically using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console.

SageMaker AI frees up all of the resources that were deployed when the endpoint was created. Deleting an endpoint will not delete the endpoint configuration or the SageMaker AI model. See [Delete Endpoint Configuration](#realtime-endpoints-delete-endpoint-config) and [Delete Model](#realtime-endpoints-delete-model) for information on how to delete your endpoint configuration and SageMaker AI model.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API to delete your endpoint. Specify the name of your endpoint for the `EndpointName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
```

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint.html) command to delete your endpoint. Specify the name of your endpoint for the `endpoint-name` flag.

```
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>
```

------
#### [ SageMaker AI Console ]

Delete your endpoint interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Endpoints** from the drop down menu. A list of endpoints created in you AWS account will appear by name, Amazon Resource Name (ARN), creation time, status, and a time stamp of when the endpoint was last updated.

1. Select the endpoint you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

## Delete Endpoint Configuration
<a name="realtime-endpoints-delete-endpoint-config"></a>

Delete your endpoint configuration programmaticially using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console. Deleting an endpoint configuration does not delete endpoints created using this configuration. See [Delete Endpoint](#realtime-endpoints-delete-endpoint) for information on how to delete your endpoint.

Do not delete an endpoint configuration in use by an endpoint that is live or while the endpoint is being updated or created. You might lose visibility into the instance type the endpoint is using if you delete the endpoint configuration of an endpoint that is active or being created or updated.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpointConfig.html) API to delete your endpoint. Specify the name of your endpoint configuration for the `EndpointConfigName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint configuration
endpoint_config_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete endpoint configuration
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
```

You can optionally use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) API to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `EndpointConfigName` field. 

```
# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Store DescribeEndpointConfig response into a variable that we can index in the next step.
response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_name)

# Delete endpoint
endpoint_config_name = response['ProductionVariants'][0]['EndpointConfigName']
                        
# Delete endpoint configuration
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
```

For more information about other response elements returned by `DescribeEndpointConfig`, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the [SageMaker API Reference guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Service.html).

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint-config.html) command to delete your endpoint configuration. Specify the name of your endpoint configuration for the `endpoint-config-name` flag.

```
aws sagemaker delete-endpoint-config \
                        --endpoint-config-name <endpoint-config-name>
```

You can optionally use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html) command to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `endpoint-config-name` flag.

```
aws sagemaker describe-endpoint-config --endpoint-config-name <endpoint-config-name>
```

This will return a JSON response. You can copy and paste, use a JSON parser, or use a tool built for JSON parsing to obtain the endpoint configuration name associated with that endpoint.

------
#### [ SageMaker AI Console ]

Delete your endpoint configuration interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Endpoint configurations** from the dropdown menu. A list of endpoint configurations created in you AWS account will appear by name, Amazon Resource Name (ARN), and creation time.

1. Select the endpoint configuration you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

## Delete Model
<a name="realtime-endpoints-delete-model"></a>

Delete your SageMaker AI model programmaticially using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console. Deleting a SageMaker AI model only deletes the model entry that was created in SageMaker AI. Deleting a model does not delete model artifacts, inference code, or the IAM role that you specified when creating the model.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html) API to delete your SageMaker AI model. Specify the name of your model for the `ModelName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint configuration
model_name='<model_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete model
sagemaker_client.delete_model(ModelName=model_name)
```

You can optionally use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) API to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `EndpointConfigName` field. 

```
# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Store DescribeEndpointConfig response into a variable that we can index in the next step.
response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_name)

# Delete endpoint
model_name = response['ProductionVariants'][0]['ModelName']
sagemaker_client.delete_model(ModelName=model_name)
```

For more information about other response elements returned by `DescribeEndpointConfig`, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the [SageMaker API Reference guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Service.html).

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-model.html) command to delete your SageMaker AI model. Specify the name of your model for the `model-name` flag.

```
aws sagemaker delete-model \
                        --model-name <model-name>
```

You can optionally use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html) command to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `endpoint-config-name` flag.

```
aws sagemaker describe-endpoint-config --endpoint-config-name <endpoint-config-name>
```

This will return a JSON response. You can copy and paste, use a JSON parser, or use a tool built for JSON parsing to obtain the name of the model associated with that endpoint.

------
#### [ SageMaker AI Console ]

Delete your SageMaker AI model interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Models** from the dropdown menu. A list of models created in you AWS account will appear by name, Amazon Resource Name (ARN), and creation time.

1. Select the model you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

# Automatic scaling of Amazon SageMaker AI models
<a name="endpoint-auto-scaling"></a>

Amazon SageMaker AI supports automatic scaling (auto scaling) for your hosted models. *Auto scaling* dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

**Topics**
+ [

# Auto scaling policy overview
](endpoint-auto-scaling-policy.md)
+ [

# Auto scaling prerequisites
](endpoint-auto-scaling-prerequisites.md)
+ [

# Configure model auto scaling with the console
](endpoint-auto-scaling-add-console.md)
+ [

# Register a model
](endpoint-auto-scaling-add-policy.md)
+ [

# Define a scaling policy
](endpoint-auto-scaling-add-code-define.md)
+ [

# Apply a scaling policy
](endpoint-auto-scaling-add-code-apply.md)
+ [

# Instructions for editing a scaling policy
](endpoint-auto-scaling-edit.md)
+ [

# Temporarily turn off scaling policies
](endpoint-auto-scaling-suspend-scaling-activities.md)
+ [

# Delete a scaling policy
](endpoint-auto-scaling-delete.md)
+ [

# Check the status of a scaling activity by describing scaling activities
](endpoint-scaling-query-history.md)
+ [

# Scale an endpoint to zero instances
](endpoint-auto-scaling-zero-instances.md)
+ [

# Load testing your auto scaling configuration
](endpoint-scaling-loadtest.md)
+ [

# Use CloudFormation to create a scaling policy
](endpoint-scaling-cloudformation.md)
+ [

# Update endpoints that use auto scaling
](endpoint-scaling-update.md)
+ [

# Delete endpoints configured for auto scaling
](endpoint-delete-with-scaling.md)

# Auto scaling policy overview
<a name="endpoint-auto-scaling-policy"></a>

To use auto scaling, you define a scaling policy that adds and removes the number of instances for your production variant in response to actual workloads.

To automatically scale as workload changes occur, you have two options: target tracking and step scaling policies. 

In most cases, we recommend using target tracking scaling policies. With target tracking, you choose an Amazon CloudWatch metric and target value. Auto scaling creates and manages the CloudWatch alarms for the scaling policy and calculates the scaling adjustment based on the metric and the target value. The policy adds and removes the number of instances as required to keep the metric at, or close to, the specified target value. For example, a scaling policy that uses the predefined `InvocationsPerInstance` metric with a target value of 70 can keep `InvocationsPerInstance` at, or close to 70. For more information, see [Target tracking scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) in the *Application Auto Scaling User Guide*.

You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. For example, you must use step scaling if you want to enable an endpoint to scale out from zero active instances. For an overview of step scaling policies and how they work, see [Step scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

To create a target tracking scaling policy, you specify the following:
+ **Metric** — The CloudWatch metric to track, such as average number of invocations per instance. 
+ **Target value** — The target value for the metric, such as 70 invocations per instance per minute.

You can create target tracking scaling policies with either predefined metrics or custom metrics. A predefined metric is defined in an enumeration so that you can specify it by name in code or use it in the SageMaker AI console. Alternatively, you can use either the AWS CLI or the Application Auto Scaling API to apply a target tracking scaling policy based on a predefined or custom metric.

Note that scaling activities are performed with cooldown periods between them to prevent rapid fluctuations in capacity. You can optionally configure the cooldown periods for your scaling policy. 

For more information about the key concepts of auto scaling, see the following section.

## Schedule-based scaling
<a name="scheduled-scaling"></a>

You can also create scheduled actions to perform scaling activities at specific times. You can create scheduled actions that scale one time only or that scale on a recurring schedule. After a scheduled action runs, your scaling policy can continue to make decisions about whether to scale dynamically as workload changes occur. Scheduled scaling can be managed only from the AWS CLI or the Application Auto Scaling API. For more information, see [Scheduled scaling](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

## Minimum and maximum scaling limits
<a name="endpoint-auto-scaling-target-capacity"></a>

When configuring auto scaling, you must specify your scaling limits before creating a scaling policy. You set limits separately for the minimum and maximum values.

The minimum value must be at least 1, and equal to or less than the value specified for the maximum value.

The maximum value must be equal to or greater than the value specified for the minimum value. SageMaker AI auto scaling does not enforce a limit for this value.

To determine the scaling limits that you need for typical traffic, test your auto scaling configuration with the expected rate of traffic to your model.

If a variant’s traffic becomes zero, SageMaker AI automatically scales in to the minimum number of instances specified. In this case, SageMaker AI emits metrics with a value of zero.

There are three options for specifying the minimum and maximum capacity:

1. Use the console to update the **Minimum instance count** and **Maximum instance count** settings.

1. Use the AWS CLI and include the `--min-capacity` and `--max-capacity` options when running the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) command.

1. Call the [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API and specify the `MinCapacity` and `MaxCapacity` parameters.

**Tip**  
You can manually scale out by increasing the minimum value, or manually scale in by decreasing the maximum value.

## Cooldown period
<a name="endpoint-auto-scaling-target-cooldown"></a>

A *cooldown period* is used to protect against over-scaling when your model is scaling in (reducing capacity) or scaling out (increasing capacity). It does this by slowing down subsequent scaling activities until the period expires. Specifically, it blocks the deletion of instances for scale-in requests, and limits the creation of instances for scale-out requests. For more information, see [Define cooldown periods](https://docs.aws.amazon.com/autoscaling/application/userguide/target-tracking-scaling-policy-overview.html#target-tracking-cooldown) in the *Application Auto Scaling User Guide*. 

You configure the cooldown period in your scaling policy. 

If you don't specify a scale-in or a scale-out cooldown period, your scaling policy uses the default, which is 300 seconds for each.

If instances are being added or removed too quickly when you test your scaling configuration, consider increasing this value. You might see this behavior if the traffic to your model has a lot of spikes, or if you have multiple scaling policies defined for a variant.

If instances are not being added quickly enough to address increased traffic, consider decreasing this value.

## Related resources
<a name="auto-scaling-related-resources"></a>

For more information about configuring auto scaling, see the following resources:
+ [application-autoscaling](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling) section of the *AWS CLI Command Reference*
+ [Application Auto Scaling API Reference](https://docs.aws.amazon.com/autoscaling/application/APIReference/)
+ [Application Auto Scaling User Guide](https://docs.aws.amazon.com/autoscaling/application/userguide/)

**Note**  
SageMaker AI recently introduced new inference capabilities built on real-time inference endpoints. You create a SageMaker AI endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component, which is a SageMaker AI hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see [SageMaker AI adds new inference capabilities to help reduce foundation model deployment costs and latency](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/) and [Reduce model deployment costs by 50% on average using the latest features of SageMaker AI](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/) on the AWS Blog.

# Auto scaling prerequisites
<a name="endpoint-auto-scaling-prerequisites"></a>

Before you can use auto scaling, you must have already created an Amazon SageMaker AI model endpoint. You can have multiple model versions for the same endpoint. Each model is referred to as a [production (model) variant](model-ab-testing.md). For more information about deploying a model endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

To activate auto scaling for a model, you can use the SageMaker AI console, the AWS Command Line Interface (AWS CLI), or an AWS SDK through the Application Auto Scaling API. 
+ If this is your first time configuring scaling for a model, we recommend you [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md). 
+ When using the AWS CLI or the Application Auto Scaling API, the flow is to register the model as a scalable target, define the scaling policy, and then apply it. On the SageMaker AI console, under **Inference** in the navigation pane, choose **Endpoints**. Find your model's endpoint name and then choose it to find the variant name. You must specify both the endpoint name and the variant name to activate auto scaling for a model.

Auto scaling is made possible by a combination of the Amazon SageMaker AI, Amazon CloudWatch, and Application Auto Scaling APIs. For information about the minimum required permissions, see [Application Auto Scaling identity-based policy examples](https://docs.aws.amazon.com/autoscaling/application/userguide/security_iam_id-based-policy-examples.html) in the *Application Auto Scaling User Guide*.

The `SagemakerFullAccessPolicy` IAM policy has all the IAM permissions required to perform auto scaling. For more information about SageMaker AI IAM permissions, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

If you manage your own permission policy, you must include the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:DescribeEndpointConfig",
        "sagemaker:UpdateEndpointWeightsAndCapacities"
      ],
      "Resource": "*"
    },
    {    
        "Effect": "Allow",
        "Action": [
            "application-autoscaling:*"
        ],
        "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
      "Condition": {
        "StringLike": { "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"	}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DeleteAlarms"
      ],
      "Resource": "*"
    }
  ]
}
```

------

## Service-linked role
<a name="endpoint-auto-scaling-slr"></a>

Auto scaling uses the `AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint` service-linked role. This service-linked role grants Application Auto Scaling permission to describe the alarms for your policies, to monitor current capacity levels, and to scale the target resource. This role is created for you automatically. For automatic role creation to succeed, you must have permission for the `iam:CreateServiceLinkedRole` action. For more information, see [Service-linked roles](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-service-linked-roles.html) in the *Application Auto Scaling User Guide*.

# Configure model auto scaling with the console
<a name="endpoint-auto-scaling-add-console"></a>

**To configure auto scaling for a model (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the navigation pane, choose **Inference**, and then choose **Endpoints**. 

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. On the **Configure variant automatic scaling** page, for **Variant automatic scaling**, do the following:

   1. For **Minimum instance count**, type the minimum number of instances that you want the scaling policy to maintain. At least 1 instance is required.

   1. For **Maximum instance count**, type the maximum number of instances that you want the scaling policy to maintain.

1. For **Built-in scaling policy**, do the following:

   1. For the **Target metric**, `SageMakerVariantInvocationsPerInstance` is automatically selected for the metric and cannot be changed.

   1. For the **Target value**, type the average number of invocations per instance per minute for the model. To determine this value, follow the guidelines in [Load testing](endpoint-scaling-loadtest.md).

   1. (Optional) For **Scale-in cool down (seconds)** and **Scale-out cool down (seconds)**, enter the amount of time, in seconds, for each cool down period.

   1. (Optional) Select **Disable scale in** if you don’t want auto scaling to terminate instances when traffic decreases.

1. Choose **Save**.

This procedure registers a model as a scalable target with Application Auto Scaling. When you register a model, Application Auto Scaling performs validation checks to ensure the following:
+ The model exists
+ The permissions are sufficient
+ You aren't registering a variant with an instance that is a burstable performance instance such as T2
**Note**  
SageMaker AI doesn't support auto scaling for burstable instances such as T2, because they already allow for increased capacity under increased workloads. For information about burstable performance instances, see [Amazon EC2 instance types](https://aws.amazon.com/ec2/instance-types/).

# Register a model
<a name="endpoint-auto-scaling-add-policy"></a>

Before you add a scaling policy to your model, you first must register your model for auto scaling and define the scaling limits for the model.

The following procedures cover how to register a model (production variant) for auto scaling using the AWS Command Line Interface (AWS CLI) or Application Auto Scaling API.

**Topics**
+ [

## Register a model (AWS CLI)
](#endpoint-auto-scaling-add-cli)
+ [

## Register a model (Application Auto Scaling API)
](#endpoint-auto-scaling-add-api)

## Register a model (AWS CLI)
<a name="endpoint-auto-scaling-add-cli"></a>

To register your production variant, use the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) command with the following parameters:
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--resource-id`—The resource identifier for the model (specifically, the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--min-capacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `max-capacity`.
+ `--max-capacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `min-capacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to have one to eight instances.  

```
aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 \
  --max-capacity 8
```

## Register a model (Application Auto Scaling API)
<a name="endpoint-auto-scaling-add-api"></a>

To register your model with Application Auto Scaling, use the [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) Application Auto Scaling API action with the following parameters:
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the production variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `MinCapacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `MaxCapacity`.
+ `MaxCapacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `MinCapacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to use one to eight instances.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.RegisterScalableTarget
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "MinCapacity": 1,
    "MaxCapacity": 8
}
```

# Define a scaling policy
<a name="endpoint-auto-scaling-add-code-define"></a>

Before you add a scaling policy to your model, save your policy configuration as a JSON block in a text file. You use that text file when invoking the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. You can optimize scaling by choosing an appropriate CloudWatch metric. However, before using a custom metric in production, you must test auto scaling with your custom metric.

**Topics**
+ [

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)
](#endpoint-auto-scaling-add-code-predefined)
+ [

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)
](#endpoint-auto-scaling-add-code-high-res)
+ [

## Define a custom metric (CloudWatch metric: CPUUtilization)
](#endpoint-auto-scaling-add-code-custom)
+ [

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)
](#endpoint-auto-scaling-online-explainability)
+ [

## Specify cooldown periods
](#endpoint-auto-scaling-add-code-cooldown)

This section shows you example policy configurations for target tracking scaling policies.

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)
<a name="endpoint-auto-scaling-add-code-predefined"></a>

**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    }
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)
<a name="endpoint-auto-scaling-add-code-high-res"></a>

With the following high-resolution CloudWatch metrics, you can set scaling policies for the volume of concurrent requests that your models receive:

**ConcurrentRequestsPerModel**  
The number of concurrent requests being received by a model container.

**ConcurrentRequestsPerCopy**  
The number of concurrent requests being received by an inference component.

These metrics track the number of simultaneous requests that your model containers handle, including the requests that are queued inside the containers. For models that send their inference response as a stream of tokens, these metrics track each request until the model sends the last token for the request.

As high-resolution metrics, they emit data more frequently than standard CloudWatch metrics. Standard metrics, such as the `InvocationsPerInstance` metric, emit data once every minute. However, these high-resolution metrics emit data every 10 seconds. Therefore, as the concurrent traffic to your models increases, your policy reacts by scaling out much more quickly than it would for standard metrics. However, as the traffic to your models decreases, your policy scales in at the same speed as it would for standard metrics.

The following is an example target tracking policy configuration that adds instances if the number of concurrent requests per model exceeds 5. Save this configuration in a file named `config.json`.

```
{
    "TargetValue": 5.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"
    }
}
```

If you use inference components to deploy multiple models to the same endpoint, you can create an equivalent policy. In that case, set `PredefinedMetricType` to `SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution`.

For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Define a custom metric (CloudWatch metric: CPUUtilization)
<a name="endpoint-auto-scaling-add-code-custom"></a>

To create a target tracking scaling policy with a custom metric, specify the metric's name, namespace, unit, statistic, and zero or more dimensions. A dimension consists of a dimension name and a dimension value. You can use any production variant metric that changes in proportion to capacity. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scales the variant based on an average CPU utilization of 50 percent across all instances. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 50.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```
For more information, see [CustomizedMetricSpecification](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)
<a name="endpoint-auto-scaling-online-explainability"></a>

When the endpoint has online explainability activated, it emits a `ExplanationsPerInstance` metric that outputs the average number of records explained per minute, per instance, for a variant. The resource utilization of explaining records can be more different than that of predicting records. We strongly recommend using this metric for target tracking scaling of endpoints with online explainability activated.

You can create multiple target tracking policies for a scalable target. Consider adding the `InvocationsPerInstance` policy from the [Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)](#endpoint-auto-scaling-add-code-predefined) section (in addition to the `ExplanationsPerInstance` policy). If most invocations don't return an explanation because of the threshold value set in the `EnableExplanations` parameter, then the endpoint can choose the `InvocationsPerInstance` policy. If there is a large number of explanations, the endpoint can use the `ExplanationsPerInstance` policy. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scale adjusts the number of variant instances so that each instance has an `ExplanationsPerInstance` metric of 20. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 20.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "ExplanationsPerInstance",
        "Namespace": "AWS/SageMaker",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Sum"
    }
}
```

For more information, see [CustomizedMetricSpecification](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Specify cooldown periods
<a name="endpoint-auto-scaling-add-code-cooldown"></a>

You can optionally define cooldown periods in your target tracking scaling policy by specifying the `ScaleOutCooldown` and `ScaleInCooldown` parameters. 

**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. The policy configuration provides a scale-in cooldown period of 10 minutes (600 seconds) and a scale-out cooldown period of 5 minutes (300 seconds). Save this configuration in a file named `config.json`.   

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*. 

# Apply a scaling policy
<a name="endpoint-auto-scaling-add-code-apply"></a>

After you register your model and define a scaling policy, apply the scaling policy to the registered model. This section shows how to apply a scaling policy using the the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. 

**Topics**
+ [

## Apply a target tracking scaling policy (AWS CLI)
](#endpoint-auto-scaling-add-code-apply-cli)
+ [

## Apply a scaling policy (Application Auto Scaling API)
](#endpoint-auto-scaling-add-code-apply-api)

## Apply a target tracking scaling policy (AWS CLI)
<a name="endpoint-auto-scaling-add-code-apply-cli"></a>

To apply a scaling policy to your model, use the [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) AWS CLI command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--policy-type`—Set this value to `TargetTrackingScaling`.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--target-tracking-scaling-policy-configuration`—The target-tracking scaling policy configuration to use for the model.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. For the `--target-tracking-scaling-policy-configuration` option, specify the `config.json` file that you created previously.   

```
aws application-autoscaling put-scaling-policy \
  --policy-name my-scaling-policy \
  --policy-type TargetTrackingScaling \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --target-tracking-scaling-policy-configuration file://config.json
```

## Apply a scaling policy (Application Auto Scaling API)
<a name="endpoint-auto-scaling-add-code-apply-api"></a>

To apply a scaling policy to a variant with the Application Auto Scaling API, use the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `PolicyType`—Set this value to `TargetTrackingScaling`.
+ `TargetTrackingScalingPolicyConfiguration`—The target-tracking scaling policy configuration to use for the variant.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. The policy configuration keeps the average invocations per instance at 70.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "PolicyType": "TargetTrackingScaling",
    "TargetTrackingScalingPolicyConfiguration": {
        "TargetValue": 70.0,
        "PredefinedMetricSpecification":
        {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        }
    }
}
```

# Instructions for editing a scaling policy
<a name="endpoint-auto-scaling-edit"></a>

After creating a scaling policy, you can edit any of its settings except the name.

 To edit a target tracking scaling policy with the AWS Management Console, use the same procedure that you used to [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md).

You can use the AWS CLI or the Application Auto Scaling API to edit a scaling policy in the same way that you create a new scaling policy. For more information, see [Apply a scaling policy](endpoint-auto-scaling-add-code-apply.md).

# Temporarily turn off scaling policies
<a name="endpoint-auto-scaling-suspend-scaling-activities"></a>

After you configure auto scaling, you have the following options if you need to investigate an issue without interference from scaling policies (dynamic scaling):
+ Temporarily suspend and then resume scaling activities by calling the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) CLI command or [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API action, specifying a Boolean value for both `DynamicScalingInSuspended` and `DynamicScalingOutSuspended`.   
**Example**  

  The following example shows how to suspend scaling policies for a variant named `my-variant`, running on the `my-endpoint` endpoint.

  ```
  aws application-autoscaling register-scalable-target \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant \
    --scalable-dimension sagemaker:variant:DesiredInstanceCount \
    --suspended-state '{"DynamicScalingInSuspended":true,"DynamicScalingOutSuspended":true}'
  ```
+ Prevent specific target tracking scaling policies from scaling in your variant by disabling the policy's scale-in portion. This method prevents the scaling policy from deleting instances, while still allowing it to create them as needed.

  Temporarily disable and then enable scale-in activities by editing the policy using the [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) CLI command or the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API action, specifying a Boolean value for `DisableScaleIn`.  
**Example**  

  The following is an example of a target tracking configuration for a scaling policy that will scale out but not scale in. 

  ```
  {
      "TargetValue": 70.0,
      "PredefinedMetricSpecification":
      {
          "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
      },
      "DisableScaleIn": true
  }
  ```

# Delete a scaling policy
<a name="endpoint-auto-scaling-delete"></a>

If you no longer need a scaling policy, you can delete it at any time.

**Topics**
+ [

## Delete all scaling policies and deregister the model (console)
](#endpoint-auto-scaling-delete-console)
+ [

## Delete a scaling policy (AWS CLI or Application Auto Scaling API)
](#endpoint-auto-scaling-delete-code)

## Delete all scaling policies and deregister the model (console)
<a name="endpoint-auto-scaling-delete-console"></a>

**To delete all scaling policies and deregister the variant as a scalable target**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the navigation pane, choose **Endpoints**.

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. Choose **Deregister auto scaling**.

## Delete a scaling policy (AWS CLI or Application Auto Scaling API)
<a name="endpoint-auto-scaling-delete-code"></a>

You can use the AWS CLI or the Application Auto Scaling API to delete a scaling policy from a variant.

### Delete a scaling policy (AWS CLI)
<a name="endpoint-auto-scaling-delete-code-cli"></a>

To delete a scaling policy from a variant, use the [delete-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/delete-scaling-policy.html) command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
aws application-autoscaling delete-scaling-policy \
  --policy-name my-scaling-policy \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount
```

### Delete a scaling policy (Application Auto Scaling API)
<a name="endpoint-auto-scaling-delete-code-api"></a>

To delete a scaling policy from your variant, use the [DeleteScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeleteScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeleteScalingPolicy
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount"
}
```

# Check the status of a scaling activity by describing scaling activities
<a name="endpoint-scaling-query-history"></a>

You can check the status of a scaling activity for your auto scaled endpoint by describing scaling activities. Application Auto Scaling provides descriptive information about the scaling activities in the specified namespace from the previous six weeks. For more information, see [Scaling activities for Application Auto Scaling](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-scaling-activities.html) in the *Application Auto Scaling User Guide*.

To check the status of a scaling activity, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command. You can't check the status of a scaling activity using the console.

**Topics**
+ [

## Describe scaling activities (AWS CLI)
](#endpoint-how-to)
+ [

## Identify blocked scaling activities from instance quotas (AWS CLI)
](#endpoint-identify-blocked-autoscaling)

## Describe scaling activities (AWS CLI)
<a name="endpoint-how-to"></a>

To describe scaling activities for all SageMaker AI resources that registered with Application Auto Scaling, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command, specifying `sagemaker` for the `--service-namespace` option.

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker
```

To describe scaling activities for a specific resource, include the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant
```

The following example shows the output produced when you run this command.

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "string",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "string",
    "StatusMessage": "string"
}
```

## Identify blocked scaling activities from instance quotas (AWS CLI)
<a name="endpoint-identify-blocked-autoscaling"></a>

When you scale out (add more instances), you might reach your account-level instance quota. You can use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command to check whether you have reached your instance quota. When you exceed your quota, auto scaling is blocked. 

To check if you have reached your instance quota, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command and specify the resource ID for the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant
```

Within the return syntax, check the [StatusCode](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusCode) and [StatusMessage](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusMessage) keys and their associated values. `StatusCode` returns `Failed`. Within `StatusMessage` there is a message indicating that the account-level service quota was reached. The following is an example of what that message might look like: 

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "minimum capacity was set to 110",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "Failed",
    "StatusMessage": "Failed to set desired instance count to 110. Reason: The 
    account-level service limit 'ml.xx.xxxxxx for endpoint usage' is 1000 
    Instances, with current utilization of 997 Instances and a request delta 
    of 20 Instances. Please contact AWS support to request an increase for this 
    limit. (Service: AmazonSageMaker; Status Code: 400; 
    Error Code: ResourceLimitExceeded; Request ID: request-id)."
}
```

# Scale an endpoint to zero instances
<a name="endpoint-auto-scaling-zero-instances"></a>

When you set up auto scaling for an endpoint, you can allow the scale-in process to reduce the number of in-service instances to zero. By doing so, you save costs during periods when your endpoint isn't serving inference requests and therefore doesn't require any active instances. 

However, after scaling in to zero instances, your endpoint can't respond to any incoming inference requests until it provisions at least one instance. To automate the provisioning process, you create a step scaling policy with Application Auto Scaling. Then, you assign the policy to an Amazon CloudWatch alarm.

After you set up the step scaling policy and the alarm, your endpoint will automatically provision an instance soon after it receives an inference request that it can't respond to. Be aware that the provisioning process takes several minutes. During that time, any attempts to invoke the endpoint will produce an error.

The following procedures explain how to set up auto scaling for an endpoint so that it scales in to, and out from, zero instances. The procedures use commands with the AWS CLI.

**Before you begin**

Before your endpoint can scale in to, and out from, zero instances, it must meet the following requirements:
+ It is in service.
+ It hosts one or more inference components. An endpoint can scale to and from zero instances only if it hosts inference components.

  For information about hosting inference components on SageMaker AI endpoints, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).
+ In the endpoint configuration, for the production variant `ManagedInstanceScaling` object, you've set the `MinInstanceCount` parameter to `0`.

  For reference information about this parameter, see [ProductionVariantManagedInstanceScaling](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariantManagedInstanceScaling.html).

**To enable an endpoint to scale in to zero instances (AWS CLI)**

For each inference component that the endpoint hosts, do the following:

1. Register the inference component as a scalable target. When you register it, set the minimum capacity to `0`, as shown by the following command:

   ```
   aws application-autoscaling register-scalable-target \
     --service-namespace sagemaker \
     --resource-id inference-component/inference-component-name \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --min-capacity 0 \
     --max-capacity n
   ```

   In this example, replace *inference-component-name* with the name of your inference component. Replace *n* with the maximum number of inference component copies to provision when scaling out.

   For more information about this command and each of its parameters, see [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) in the *AWS CLI Command Reference*.

1. Apply a target tracking policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type TargetTrackingScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --target-tracking-scaling-policy-configuration file://config.json
   ```

   In this example, replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a target tracking policy configuration, such as the following:

   ```
   {
     "PredefinedMetricSpecification": {
         "PredefinedMetricType": "SageMakerInferenceComponentInvocationsPerCopy"
     },
     "TargetValue": 1,
     "ScaleInCooldown": 300,
     "ScaleOutCooldown": 300
   }
   ```

   For more example tracking policy configurations, see [Define a scaling policy](endpoint-auto-scaling-add-code-define.md).

   For more information about this command and each of its parameters, see [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) in the *AWS CLI Command Reference*.

**To enable an endpoint to scale out from zero instances (AWS CLI)**

For each inference component that the endpoint hosts, do the following:

1. Apply a step scaling policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type StepScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --step-scaling-policy-configuration file://config.json
   ```

   In this example, replace *my-scaling-policy* with a unique name for your policy. Replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a step scaling policy configuration, such as the following:

   ```
   {
       "AdjustmentType": "ChangeInCapacity",
       "MetricAggregationType": "Maximum",
       "Cooldown": 60,
       "StepAdjustments":
         [
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
         ]
   }
   ```

   When this step scaling policy is triggered, SageMaker AI provisions the necessary instances to support the inference component copies.

   After you create the step scaling policy, take note of its Amazon Resource Name (ARN). You need the ARN for the CloudWatch alarm in the next step.

   For more information about step scaling polices, see [Step scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

1. Create a CloudWatch alarm and assign the step scaling policy to it, as shown by the following example:

   ```
   aws cloudwatch put-metric-alarm \
   --alarm-actions step-scaling-policy-arn \
   --alarm-description "Alarm when SM IC endpoint invoked that has 0 instances." \
   --alarm-name ic-step-scaling-alarm \
   --comparison-operator GreaterThanThreshold  \
   --datapoints-to-alarm 1 \
   --dimensions "Name=InferenceComponentName,Value=inference-component-name" \
   --evaluation-periods 1 \
   --metric-name NoCapacityInvocationFailures \
   --namespace AWS/SageMaker \
   --period 60 \
   --statistic Sum \
   --threshold 1
   ```

   In this example, replace *step-scaling-policy-arn* with the ARN of your step scaling policy. Replace *ic-step-scaling-alarm* with a name of your choice. Replace *inference-component-name* with the name of your inference component. 

   This example sets the `--metric-name` parameter to `NoCapacityInvocationFailures`. SageMaker AI emits this metric when an endpoint receives an inference request, but the endpoint has no active instances to serve the request. When that event occurs, the alarm initiates the step scaling policy in the previous step.

   For more information about this command and each of its parameters, see [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) in the *AWS CLI Command Reference*.

# Load testing your auto scaling configuration
<a name="endpoint-scaling-loadtest"></a>

Perform load tests to choose a scaling configuration that works the way you want.

The following guidelines for load testing assume you are using a scaling policy that uses the predefined target metric `SageMakerVariantInvocationsPerInstance`.

**Topics**
+ [

## Determine the performance characteristics
](#endpoint-scaling-loadtest-variant)
+ [

## Calculate the target load
](#endpoint-scaling-loadtest-calc)

## Determine the performance characteristics
<a name="endpoint-scaling-loadtest-variant"></a>

Perform load testing to find the peak `InvocationsPerInstance` that your model's production variant can handle, and the latency of requests, as concurrency increases.

This value depends on the instance type chosen, payloads that clients of your model typically send, and the performance of any external dependencies your model has.

**To find the peak requests-per-second (RPS) your model's production variant can handle and latency of requests**

1. Set up an endpoint with your model using a single instance. For information about how to set up an endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

1. Use a load testing tool to generate an increasing number of parallel requests, and monitor the RPS and model latency in the out put of the load testing tool. 
**Note**  
You can also monitor requests-per-minute instead of RPS. In that case don't multiply by 60 in the equation to calculate `SageMakerVariantInvocationsPerInstance` shown below.

   When the model latency increases or the proportion of successful transactions decreases, this is the peak RPS that your model can handle.

## Calculate the target load
<a name="endpoint-scaling-loadtest-calc"></a>

After you find the performance characteristics of the variant, you can determine the maximum RPS we should allow to be sent to an instance. The threshold used for scaling must be less than this maximum value. Use the following equation in combination with load testing to determine the correct value for the `SageMakerVariantInvocationsPerInstance` target metric in your scaling configuration.

```
SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60
```

Where `MAX_RPS` is the maximum RPS that you determined previously, and `SAFETY_FACTOR` is the safety factor that you chose to ensure that your clients don't exceed the maximum RPS. Multiply by 60 to convert from RPS to invocations-per-minute to match the per-minute CloudWatch metric that SageMaker AI uses to implement auto scaling (you don't need to do this if you measured requests-per-minute instead of requests-per-second).

**Note**  
SageMaker AI recommends that you start testing with a `SAFETY_FACTOR` of 0.5. Test your scaling configuration to ensure it operates in the way you expect with your model for both increasing and decreasing customer traffic on your endpoint.

# Use CloudFormation to create a scaling policy
<a name="endpoint-scaling-cloudformation"></a>

The following example shows how to configure model auto scaling on an endpoint using CloudFormation.

```
  Endpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointName: yourEndpointName
      EndpointConfigName: yourEndpointConfigName

  ScalingTarget:
    Type: "AWS::ApplicationAutoScaling::ScalableTarget"
    Properties:
      MaxCapacity: 10
      MinCapacity: 2
      ResourceId: endpoint/my-endpoint/variant/my-variant
      RoleARN: arn
      ScalableDimension: sagemaker:variant:DesiredInstanceCount
      ServiceNamespace: sagemaker

  ScalingPolicy:
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: my-scaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId:
        Ref: ScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 70.0
        ScaleInCooldown: 600
        ScaleOutCooldown: 30
        PredefinedMetricSpecification:
          PredefinedMetricType: SageMakerVariantInvocationsPerInstance
```

For more information, see [Create Application Auto Scaling resources with AWS CloudFormation](https://docs.aws.amazon.com/autoscaling/application/userguide/creating-resources-with-cloudformation.html) in the *Application Auto Scaling User Guide*.

# Update endpoints that use auto scaling
<a name="endpoint-scaling-update"></a>

When you update an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If the update would change the instance type for any model that is a target for auto scaling, the update fails. 

In the AWS Management Console, you see a warning that you must deregister the model from auto scaling before you can update it. If you are trying to update the endpoint by calling the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, the call fails. Before you update the endpoint, delete any scaling policies configured for it and deregister the variant as a scalable target by calling the [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) Application Auto Scaling API action. After you update the endpoint, you can register the updated variant as a scalable target and attach a scaling policy.

There is one exception. If you change the model for a variant that is configured for auto scaling, Amazon SageMaker AI auto scaling allows the update. This is because changing the model doesn't typically affect performance enough to change scaling behavior. If you do update a model for a variant configured for auto scaling, ensure that the change to the model doesn't significantly affect performance and scaling behavior.

When you update SageMaker AI endpoints that have auto scaling applied, complete the following steps:

**To update an endpoint that has auto scaling applied**

1. Deregister the endpoint as a scalable target by calling [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html).

1. Because auto scaling is blocked while the update operation is in progress (or if you turned off auto scaling in the previous step), you might want to take the additional precaution of increasing the number of instances for your endpoint during the update. To do this, update the instance counts for the production variants hosted at the endpoint by calling [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html).

1. Call [ DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) repeatedly until the value of the `EndpointStatus` field of the response is `InService`.

1. Call [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) to get the values of the current endpoint config.

1. Create a new endpoint config by calling [ CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). For the production variants where you want to keep the existing instance count or weight, use the same variant name from the response from the call to [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step. For all other values, use the values that you got as the response when you called [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step.

1. Update the endpoint by calling [ UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html). Specify the endpoint config you created in the previous step as the `EndpointConfig` field. If you want to retain the variant properties like instance count or weight, set the value of the `RetainAllVariantProperties` parameter to `True`. This specifies that production variants with the same name will are updated with the most recent `DesiredInstanceCount` from the response from the call to `DescribeEndpoint`, regardless of the values of the `InitialInstanceCount` field in the new `EndpointConfig`.

1. (Optional) Re-activate auto scaling by calling [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) and [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html).

**Note**  
Steps 1 and 7 are required only if you are updating an endpoint with the following changes:  
Changing the instance type for a production variant that has auto scaling configured
Removing a production variant that has auto scaling configured.

# Delete endpoints configured for auto scaling
<a name="endpoint-delete-with-scaling"></a>

If you delete an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If any are and you have permission to deregister the model, Application Auto Scaling deregisters those models as scalable targets without notifying you. If you use a custom permission policy that doesn't provide permission for the [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) action, you must request access to this action before deleting the endpoint.

**Note**  
As an IAM user, you might not have sufficient permission to delete an endpoint if another user configured auto scaling for a variant on that endpoint.

# Instance storage volumes
<a name="host-instance-storage"></a>

When you create an endpoint, Amazon SageMaker AI attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to Amazon EC2 instances that hosts the endpoint. The size of the storage volume is scalable, and storage options are divided into two categories: SSD-backed storage and HDD-backed storage. 

For more information about Amazon EBS storages and features, see the following pages.
+ [Amazon EBS Features](https://aws.amazon.com/ebs/features/)
+ [ Amazon EBS User Guide ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html)

For a full list of the host instance storage volumes, see [Host Instance Storage Volumes Table](https://aws.amazon.com/releasenotes/host-instance-storage-volumes-table/) 

**Note**  
Amazon SageMaker AI attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to Amazon EC2 instances only when you create [Asynchronous inference](async-inference.md) or [Real-time inference](realtime-endpoints.md) endpoint types. For more information on customizing Amazon EBS storage volume, see [SageMaker AI endpoint parameters for large model inference](large-model-inference-hosting.md).

# Validation of models in production
<a name="model-validation"></a>

 With SageMaker AI, you can test multiple models or model versions behind the same endpoint using variants. A variant consists of an ML instance and the serving components specified in a SageMaker AI model. You can have multiple variants behind an endpoint. Each variant can have a different instance type or a SageMaker AI model that can be autoscaled independently of the others. The models within the variants can be trained using different datasets, different algorithms, different ML frameworks, or any combination of all of these. All the variants behind an endpoint share the same inference code. SageMaker AI supports two types of variants, production variants and shadow variants. 

 If you have multiple production variants behind an endpoint, then you can allocate a portion of your inference requests to each variant. Each request is routed to only one of the production variants. The production variant to which the request was routed provides the response to the caller. You can compare how the production variants perform relative to each other. 

 You can also have a shadow variant corresponding to a production variant behind an endpoint. A portion of the inference requests that goes to the production variant is replicated to the shadow variant. The responses of the shadow variant are logged for comparison and not returned to the caller. This lets you test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant. 

**Topics**
+ [

# Testing models with production variants
](model-ab-testing.md)
+ [

# Testing models with shadow variants
](model-shadow-deployment.md)

# Testing models with production variants
<a name="model-ab-testing"></a>

 In production ML workflows, data scientists and engineers frequently try to improve performance using various methods, such as [Automatic model tuning with SageMaker AI](automatic-model-tuning.md), training on additional or more-recent data, improving feature selection, using better updated instances and serving containers. You can use production variants to compare your models, instances and containers, and choose the best performing candidate to respond to inference requests. 

 With SageMaker AI multi-variant endpoints you can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant, or you can invoke a specific variant directly for each request. In this topic, we look at both methods for testing ML models. 

**Topics**
+ [

## Test models by specifying traffic distribution
](#model-testing-traffic-distribution)
+ [

## Test models by invoking specific variants
](#model-testing-target-variant)
+ [

## Model A/B test example
](#model-ab-test-example)

## Test models by specifying traffic distribution
<a name="model-testing-traffic-distribution"></a>

 To test multiple models by distributing traffic between them, specify the percentage of the traffic that gets routed to each model by specifying the weight for each production variant in the endpoint configuration. For information, see [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). The following diagram shows how this works in more detail. 

![\[Example showing how distributing traffic between models using InvokeEndpoint works in SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-traffic-distribution.png)


## Test models by invoking specific variants
<a name="model-testing-target-variant"></a>

 To test multiple models by invoking specific models for each request, specify the specific version of the model you want to invoke by providing a value for the `TargetVariant` parameter when you call [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html). SageMaker AI ensures that the request is processed by the production variant you specify. If you have already provided traffic distribution and specify a value for the `TargetVariant` parameter, the targeted routing overrides the random traffic distribution. The following diagram shows how this works in more detail. 

![\[Example showing how invoking specific models for each request using InvokeEndpoint works in SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-target-variant.png)


## Model A/B test example
<a name="model-ab-test-example"></a>

 Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs. If the newer version of the model delivers better performance than the previously existing version, replace the old version of the model with the new version in production. 

 The following example shows how to perform A/B model testing. For a sample notebook that implements this example, see ["A/B Testing ML models in production](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_endpoints/a_b_testing/a_b_testing.html). 

### Step 1: Create and deploy models
<a name="model-ab-test-step1"></a>

 First, we define where our models are located in Amazon S3. These locations are used when we deploy our models in subsequent steps: 

```
model_url = f"s3://{path_to_model_1}"
model_url2 = f"s3://{path_to_model_2}"
```

 Next, we create the model objects with the image and model data. These model objects are used to deploy production variants on an endpoint. The models are developed by training ML models on different data sets, different algorithms or ML frameworks, and different hyperparameters: 

```
from sagemaker.amazon.amazon_estimator import get_image_uri

model_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
model_name2 = f"DEMO-xgb-churn-pred2-{datetime.now():%Y-%m-%d-%H-%M-%S}"
image_uri = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
image_uri2 = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-2')

sm_session.create_model(
    name=model_name,
    role=role,
    container_defs={
        'Image': image_uri,
        'ModelDataUrl': model_url
    }
)

sm_session.create_model(
    name=model_name2,
    role=role,
    container_defs={
        'Image': image_uri2,
        'ModelDataUrl': model_url2
    }
)
```

 We now create two production variants, each with its own different model and resource requirements (instance type and counts). This enables you to also test models on different instance types. 

 We set an initial\$1weight of 1 for both variants. This means that 50% of requests go to `Variant1`, and the remaining 50% of requests to `Variant2`. The sum of weights across both variants is 2 and each variant has weight assignment of 1. This means that each variant receives 1/2, or 50%, of the total traffic. 

```
from sagemaker.session import production_variant

variant1 = production_variant(
               model_name=model_name,
               instance_type="ml.m5.xlarge",
               initial_instance_count=1,
               variant_name='Variant1',
               initial_weight=1,
           )

variant2 = production_variant(
               model_name=model_name2,
               instance_type="ml.m5.xlarge",
               initial_instance_count=1,
               variant_name='Variant2',
               initial_weight=1,
           )
```

 Finally we’re ready to deploy these production variants on a SageMaker AI endpoint. 

```
endpoint_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
print(f"EndpointName={endpoint_name}")

sm_session.endpoint_from_production_variants(
    name=endpoint_name,
    production_variants=[variant1, variant2]
)
```

### Step 2: Invoke the deployed models
<a name="model-ab-test-step2"></a>

 Now we send requests to this endpoint to get inferences in real time. We use both traffic distribution and direct targeting. 

 First, we use traffic distribution that we configured in the previous step. Each inference response contains the name of the production variant that processes the request, so we can see that traffic to the two production variants is roughly equal. 

```
# get a subset of test data for a quick test
!tail -120 test_data/test-dataset-input-cols.csv > test_data/test_sample_tail_input_cols.csv
print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")

with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="text/csv",
            Body=payload
        )
        time.sleep(0.5)

print("Done!")
```

 SageMaker AI emits metrics such as `Latency` and `Invocations` for each variant in Amazon CloudWatch. For a complete list of metrics that SageMaker AI emits, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). Let’s query CloudWatch to get the number of invocations per variant, to show how invocations are split across variants by default: 

![\[Example CloudWatch number of invocations per variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-variant-invocations.png)


 Now let's invoke a specific version of the model by specifying `Variant1` as the `TargetVariant` in the call to `invoke_endpoint`. 

```
print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")
with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="text/csv",
            Body=payload,
            TargetVariant="Variant1"
        ) 
        time.sleep(0.5)
```

 To confirm that all new invocations were processed by `Variant1`, we can query CloudWatch to get the number of invocations per variant. We see that for the most recent invocations (latest timestamp), all requests were processed by `Variant1`, as we had specified. There were no invocations made for `Variant2`. 

![\[Example CloudWatch number of invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-target1.png)


### Step 3: Evaluate model performance
<a name="model-ab-test-step3"></a>

 To see which model version performs better, let's evaluate the accuracy, precision, recall, F1 score, and Receiver operating charactersistic/Area under the curve for each variant. First, let's look at these metrics for `Variant1`: 

![\[Example receiver operating characteristic curve for Variant1.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-curve.png)


Now let's look at the metrics for `Variant2`:

![\[Example receiver operating characteristic curve for Variant2.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model2-curve.png)


 For most of our defined metrics, `Variant2` is performing better, so this is the one that we want to use in production. 

### Step 4: Increase traffic to the best model
<a name="model-ab-test-step4"></a>

 Now that we have determined that `Variant2` performs better than `Variant1`, we shift more traffic to it. We can continue to use `TargetVariant` to invoke a specific model variant, but a simpler approach is to update the weights assigned to each variant by calling [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html). This changes the traffic distribution to your production variants without requiring updates to your endpoint. Recall from the setup section that we set variant weights to split traffic 50/50. The CloudWatch metrics for the total invocations for each variant below show us the invocation patterns for each variant: 

![\[Example CloudWatch metrics for the total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-even-dist.png)


 Now we shift 75% of the traffic to `Variant2` by assigning new weights to each variant using `UpdateEndpointWeightsAndCapacities`. SageMaker AI now sends 75% of the inference requests to `Variant2` and remaining 25% of requests to `Variant1`. 

```
sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 25,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 75,
            "VariantName": variant2["VariantName"]
        }
    ]
)
```

 The CloudWatch metrics for total invocations for each variant shows us higher invocations for `Variant2` than for `Variant1`: 

![\[Example CloudWatch metrics for total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-75-25.png)


 We can continue to monitor our metrics, and when we're satisfied with a variant's performance, we can route 100% of the traffic to that variant. We use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html) to update the traffic assignments for the variants. The weight for `Variant1` is set to 0 and the weight for `Variant2` is set to 1. SageMaker AI now sends 100% of all inference requests to `Variant2`. 

```
sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 0,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 1,
            "VariantName": variant2["VariantName"]
        }
    ]
)
```

 The CloudWatch metrics for the total invocations for each variant show that all inference requests are being processed by `Variant2` and there are no inference requests processed by `Variant1`. 

![\[Example CloudWatch metrics for the total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-best-model.png)


 You can now safely update your endpoint and delete `Variant1` from your endpoint. You can also continue testing new models in production by adding new variants to your endpoint and following steps 2 - 4. 

# Testing models with shadow variants
<a name="model-shadow-deployment"></a>

 You can use SageMaker AI Model Shadow Deployments to create long running shadow variants to validate any new candidate component of your model serving stack before promoting it to production. The following diagram shows how shadow variants work in more detail. 

![\[Details of a shadow variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-variant.png)


## Deploy shadow variants
<a name="model-shadow-deployment-deploy"></a>

 The following code example shows how you can programmatically deploy shadow variants. Replace the *user placeholder text* in the example with your own information. 

1.  Create two SageMaker AI models: one for your production variant, and one for your shadow variant. 

   ```
   import boto3
   from sagemaker import get_execution_role, Session
                   
   aws_region = "aws-region"
   
   boto_session = boto3.Session(region_name=aws_region)
   sagemaker_client = boto_session.client("sagemaker")
   
   role = get_execution_role()
   
   bucket = Session(boto_session).default_bucket()
   
   model_name1 = "name-of-your-first-model"
   model_name2 = "name-of-your-second-model"
   
   sagemaker_client.create_model(
       ModelName = model_name1,
       ExecutionRoleArn = role,
       Containers=[
           {
               "Image": "ecr-image-uri-for-first-model",
               "ModelDataUrl": "s3-location-of-trained-first-model" 
           }
       ]
   )
   
   sagemaker_client.create_model(
       ModelName = model_name2,
       ExecutionRoleArn = role,
       Containers=[
           {
               "Image": "ecr-image-uri-for-second-model",
               "ModelDataUrl": "s3-location-of-trained-second-model" 
           }
       ]
   )
   ```

1.  Create an endpoint configuration. Specify both your production and shadow variants in the configuration. 

   ```
   endpoint_config_name = name-of-your-endpoint-config
   
   create_endpoint_config_response = sagemaker_client.create_endpoint_config(
       EndpointConfigName=endpoint_config_name,
       ProductionVariants=[
           {
               "VariantName": name-of-your-production-variant,
               "ModelName": model_name1,
               "InstanceType": "ml.m5.xlarge",
               "InitialInstanceCount": 1,
               "InitialVariantWeight": 1,
           }
       ],
       ShadowProductionVariants=[
           {
               "VariantName": name-of-your-shadow-variant,
               "ModelName": model_name2,
               "InstanceType": "ml.m5.xlarge",
               "InitialInstanceCount": 1,
               "InitialVariantWeight": 1,
           }
      ]
   )
   ```

1. Create an endpoint.

   ```
   create_endpoint_response = sm.create_endpoint(
       EndpointName=name-of-your-endpoint,
       EndpointConfigName=endpoint_config_name,
   )
   ```

# Online explainability with SageMaker Clarify
<a name="clarify-online-explainability"></a>

This guide shows how to configure online explainability with SageMaker Clarify. With SageMaker AI [real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) endpoints, you can analyze explainability in real time, continuously. The online explainability function fits into the **Deploy to production** part of the[ Amazon SageMaker AI Machine Learning workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html).

## How Clarify Online Explainability Works
<a name="clarify-online-explainability-how-it-works"></a>

The following graphic depicts SageMaker AI architecture for hosting an endpoint that serves explainability requests. It depicts interactions between an endpoint, the model container, and the SageMaker Clarify explainer.

![\[SageMaker AI architecture showing hosting an endpoint that serves on-demand explainability requests.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/DeveloperGuideArchitecture.png)


Here's how Clarify online explainability works. The application sends a REST-style `InvokeEndpoint` request to the SageMaker AI Runtime Service. The service routes this request to a SageMaker AI endpoint to obtain predictions and explanations. Then, the service receives the response from the endpoint. Lastly, the service sends the response back to the application.

To increase the endpoint availability, SageMaker AI automatically attempts to distribute endpoint instances in multiple Availability Zones, according to the instance count in the endpoint configuration. On an endpoint instance, upon a new explainability request, the SageMaker Clarify explainer calls the model container for predictions. Then it computes and returns the feature attributions.

Here are the four steps to create an endpoint that uses SageMaker Clarify online explainability:

1. Check if your pre-trained SageMaker AI model is compatible with online explainability by following the [pre-check](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-online-explainability-precheck.html) steps.

1. [Create an endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) with the SageMaker Clarify explainer configuration using the `CreateEndpointConfig` API.

1. [Create an endpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) and provide the endpoint configuration to SageMaker AI using the `CreateEndpoint` API. The service launches the ML compute instance and deploys the model as specified in the configuration.

1. [Invoke the endpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html): After the endpoint is in service, call the SageMaker AI Runtime API `InvokeEndpoint` to send requests to the endpoint. The endpoint then returns explanations and predictions.

# Pre-check the model container
<a name="clarify-online-explainability-precheck"></a>

This section shows how to pre-check the model container inputs and outputs for compatibility before configuring an endpoint. The SageMaker Clarify explainer is **model agnostic**, but it has requirements for model container input and output.

**Note**  
You can increase efficiency by configuring your container to support batch requests, which support two or more records in a single request. For example, a single record is a single line of CSV data, or a single line of JSON Lines data. SageMaker Clarify will attempt to send a mini-batch of records to the model container first before falling back to single record requests.

## Model container input
<a name="clarify-online-explainability-input"></a>

------
#### [ CSV ]

The model container supports input in CSV with MIME type:`text/csv`. The following table shows example inputs that SageMaker Clarify supports.


| Model container input (string representation) | Comments | 
| --- | --- | 
|  '1,2,3,4'  |  Single record that uses four numerical features.  | 
|  '1,2,3,4\$1n5,6,7,8'  |  Two records, separated by line break '\$1n'.  | 
|  '"This is a good product",5'  |  Single record that contains a text feature and a numerical feature.  | 
|  ‘"This is a good product",5\$1n"Bad shopping experience",1'  |  Two records.  | 

------
#### [ JSON Lines ]

SageMaker AI also supports input in [ JSON Lines dense format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-jsonlines) with MIME type:`application/jsonlines`, as shown in the following table.


| Model container input | Comments | 
| --- | --- | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1'  |  Single record; a list of features can be extracted by JMESPath expression `data.features`.  | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1\$1n\$1"data":\$1"features":[5,6,7,8]\$1\$1'  |  Two records.  | 
|  '\$1"features":["This is a good product",5]\$1'  |  Single record; a list of features can be extracted by JMESPath expression `features`.  | 
|  '\$1"features":["This is a good product",5]\$1\$1n\$1"features":["Bad shopping experience",1]\$1'  |  Two records.  | 

------

## Model container output
<a name="clarify-online-explainability-output"></a>

Your model container output should also be in either CSV, or JSON Lines dense format. Additionally the model container should include the probabilities of the input records, which SageMaker Clarify uses to compute feature attributions.

The following data examples are for model container outputs in **CSV format**.

------
#### [ Probability only ]

For regression and binary classification problems, the model container outputs a single probability value (score) of the predicted label. These probabilities can be extracted using column index 0. For multi-class problems, the model container outputs a list of probabilities (scores). For multi-class problems, if no index is provided, all values are extracted.


| Model container input | Model container output (string representation) | 
| --- | --- | 
|  Single record  |  '0.6'  | 
|  Two records (results in one line)  |  '0.6,0.3'  | 
|  Two records (results in two lines)  |  '0.6\$1n0.3'  | 
|  Single record of a multi-class model (three classes)  |  '0.1,0.6,0.3'  | 
|  Two records of a multi-class model (three classes)  |  '0.1,0.6,0.3\$1n0.2,0.5,0.3'  | 

------
#### [ Predicted label and probabilities ]

The model container outputs the predicted label followed by its probability in **CSV** format. The probabilities can be extracted using index `1`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '1,0.6'  | 
|  Two records  |  '1,0.6\$1n0,0.3'  | 

------
#### [ Predicted labels header and probabilities ]

A multi-class model container trained by Autopilot can be configured to output **the string representation** of the list of predicted labels and probabilities in **CSV** format. In the following example, the probabilities can be extracted by index `1`. The label headers can be extracted by index `1`, and the label headers can be extracted using index `0`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.1,0.6,0.3]"'  | 
|  Two records  |  '"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.1,0.6,0.3]"\$1n"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.2,0.5,0.3]"'  | 

------

The following data examples are for model container outputs in **JSON Lines** format.

------
#### [ Probability only ]

In this example, the model container outputs the probability that can be extracted by [https://jmespath.org/](https://jmespath.org/) expression `score` in **JSON Lines** format.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"score":0.6\$1'  | 
|  Two records  |  '\$1"score":0.6\$1\$1n\$1"score":0.3\$1'  | 

------
#### [ Predicted label and probabilities ]

In this example, a multi-class model container outputs a list of label headers along with a list of probabilities in **JSON Lines** format. The probabilities can be extracted by `JMESPath` expression `probability`, and the label headers can be extracted by `JMESPath` expression `predicted labels`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1'  | 
|  Two records  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1\$1n\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]\$1'  | 

------
#### [ Predicted labels header and probabilities ]

In this example, a multi-class model container outputs a list of label headers and probabilities in **JSON Lines** format. The probabilities can be extracted by `JMESPath` expression `probability`, and the label headers can be extracted by `JMESPath` expression `predicted labels`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1'  | 
|  Two records  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1\$1n\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]\$1'  | 

------

## Model container validation
<a name="clarify-online-explainability-container-validation"></a>

We recommend that you deploy your model to a SageMaker AI real-time inference endpoint, and send requests to the endpoint. Manually examine the requests (model container inputs) and responses (model container outputs) to make sure that both are compliant with the requirements in the **Model Container Input** section and **Model Container Output** section. If your model container supports batch requests, you can start with a single record request, and then try two or more records.

The following commands show how to request a response using the AWS CLI. The AWS CLI is pre-installed in SageMaker Studio Classic, and SageMaker Notebook instances. If you need to install the AWS CLI, follow this [installation guide](https://aws.amazon.com/cli/).

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type $CONTENT_TYPE \
  --accept $ACCEPT_TYPE \
  --body $REQUEST_DATA \
  $CLI_BINARY_FORMAT \
  /dev/stderr 1>/dev/null
```

The parameters are defined, as follows:
+ `$ENDPOINT NAME`: The name of the endpoint.
+ `$CONTENT_TYPE`: The MIME type of the request (model container input).
+ `$ACCEPT_TYPE`: The MIME type of the response (model container output).
+ `$REQUEST_DATA`: The requested payload string.
+ `$CLI_BINARY_FORMAT`: The format of the command line interface (CLI) parameter. For AWS CLI v1, this parameter should remain blank. For v2, this parameter should be set to `--cli-binary-format raw-in-base64-out`.

**Note**  
AWS CLI v2 passes binary parameters as base64-encoded strings [default](https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration.html#cliv2-migration-binaryparam).

The following examples use AWS CLI v1:

------
#### [ Request and response in CSV format ]
+ The request consists of a single record and the response is its probability value.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-sagemaker-xgboost-model \
    --content-type text/csv \
    --accept text/csv \
    --body '1,2,3,4' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`
+ The request consists of two records, and the response includes their probabilities, and the model separates the probabilities by a comma. The `$'content'` expression in the `--body` tells the command to interpret `\n` in the content as a line break.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-sagemaker-xgboost-model \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6,0.3`
+ The request consists of two records, the response includes their probabilities, and the model separates the probabilities with a line break.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`

  `0.3`
+ The request consists of a single record, and the response is probability values (multi-class model, three classes).

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body '1,2,3,4' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.1,0.6,0.3`
+ The request consists of two records, and the response includes their probability values (multi-class model, three classes).

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.1,0.6,0.3`

  `0.2,0.5,0.3`
+ The request consists of two records, and the response includes predicted label and probability.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-2 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `1,0.6`

  `0,0.3`
+ The request consists of two records and the response includes label headers and probabilities.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-3 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `"['cat','dog','fish']","[0.1,0.6,0.3]"`

  `"['cat','dog','fish']","[0.2,0.5,0.3]"`

------
#### [ Request and response in JSON Lines format ]
+ The request consists of a single record and the response is its probability value.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body '{"features":["This is a good product",5]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"score":0.6}`
+ The request contains two records, and the response includes predicted label and probability.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-2 \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"predicted_label":1,"probability":0.6}`

  `{"predicted_label":0,"probability":0.3}`
+ The request contains two records and the response includes label headers and probabilities.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-3 \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body $'{"data":{"features":[1,2,3,4]}}\n{"data":{"features":[5,6,7,8]}}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"predicted_labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]}`

  `{"predicted_labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]}`

------
#### [ Request and response in different formats ]
+ The request is in CSV format and the response is in JSON Lines format:

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-in-jsonlines-out \
    --content-type text/csv \
    --accept application/jsonlines \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"probability":0.6}`

  `{"probability":0.3}`
+ The request is in JSON Lines format and the response is in CSV format:

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-in-csv-out \
    --content-type application/jsonlines \
    --accept text/csv \
    --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`

  `0.3`

------

After the validations are complete, [delete](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-delete-resources.html) the testing endpoint.

# Configure and create an endpoint
<a name="clarify-online-explainability-create-endpoint"></a>

Create a new endpoint configuration to fit your model, and use this configuration to create the endpoint. You can use the model container validated in the [pre-check step ](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-online-explainability-precheck.html) to create an endpoint and enable the SageMaker Clarify online explainability feature.

Use the `sagemaker_client` object to create an endpoint using the [CreateEndpointConfig](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. Set the member `ClarifyExplainerConfig` inside the `ExplainerConfig` parameter as follows:

```
sagemaker_client.create_endpoint_config(
    EndpointConfigName='name-of-your-endpoint-config',
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'EnableExplanations': '`true`',
            'InferenceConfig': {
                ...
            },
            'ShapConfig': {
                ...
            }
        },
    },
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': 'name-of-your-model',
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }]
     ...
)
sagemaker_client.create_endpoint(
    EndpointName='name-of-your-endpoint',
    EndpointConfigName='name-of-your-endpoint-config'
)
```

The first call to the `sagemaker_client` object creates a new endpoint configuration with the explainability feature enabled. The second call uses the endpoint configuration to launch the endpoint.

**Note**  
You can also host multiple models in one container behind a [SageMaker AI real-time inference multi-model endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) and configure online explainability with SageMaker Clarify.

# The `EnableExplanations` expression
<a name="clarify-online-explainability-create-endpoint-enable"></a>

The `EnableExplanations` parameter is a [https://jmespath.org/](https://jmespath.org/) Boolean expression string. It is evaluated for **each record** in the explainability request. If this parameter is evaluated to be **true**, then the record will be explained. If this parameter is evaluated to be **false**, then explanations are not be generated.

SageMaker Clarify deserializes the model container output for each record into a JSON compatible data structure, and then uses the `EnableExplanations` parameter to evaluate the data.

**Notes**  
There are two options for records depending on the format of the model container output.  
If the model container output is in CSV format, then a record is loaded as a JSON array.
If the model container output is in JSON Lines format, then a record is loaded as a JSON object.

The `EnableExplanations` parameter is a JMESPath expression that can be passed either during the `InvokeEndpoint` or `CreateEndpointConfig` operations. If the JMESPath expression that you supplied is not valid, the endpoint creation will fail. If the expression is valid, but the expression evaluation result is unexpected, then the endpoint will be created successfully, but an error will be generated when the endpoint is invoked. Test your `EnableExplanations` expression by using the `InvokeEndpoint` API, and then apply it to the endpoint configuration.

The following are some examples of valid `EnableExplanations` expression. In the examples, a JMESPath expression encloses a literal using backtick characters. For example, ``true`` means true.


| Expression (string representation) | Model container output (string representation) | Evaluation result (Boolean) | Meaning | 
| --- | --- | --- | --- | 
|  '`true`'  |  (N/A)  |  True  |  Activate online explainability unconditionally.  | 
|  '`false`'  |  (N/A)  |  False  |  Deactivate online explainability unconditionally.  | 
|  '[1]>`0.5`'  |  '1,0.6'  |  True  |  For each record, the model container outputs its predicted label and probability. Explains a record if its probability (at index 1) is greater than 0.5.  | 
|  'probability>`0.5`'  |  '\$1"predicted\$1label":1,"probability":0.6\$1'  |  True  |  For each record, the model container outputs JSON data. Explain a record if its probability is greater than 0.5.  | 
|  '\$1contains(probabilities[:-1], max(probabilities))'  |  '\$1"probabilities": [0.4, 0.1, 0.4], "labels":["cat","dog","fish"]\$1'  |  False  |  For a multi-class model: Explains a record if its predicted label (the class that has the max probability value) is the last class. Literally, the expression means that the max probability value is not in the list of probabilities excluding the last one.  | 

# Synthetic dataset
<a name="clarify-online-explainability-create-endpoint-synthetic"></a>

SageMaker Clarify uses the Kernel SHAP algorithm. Given a record (also called a sample or an instance) and the SHAP configuration, the explainer first generates a synthetic dataset. SageMaker Clarify then queries the model container for the predictions of the dataset, and then computes and returns the feature attributions. The size of the synthetic dataset affects the runtime for the Clarify explainer. Larger synthetic datasets take more time to obtain model predictions than smaller ones.

 The synthetic dataset size is determined by the following formula:

```
Synthetic dataset size = SHAP baseline size * n_samples
```

The SHAP baseline size is the number of records in the SHAP baseline data. This information is taken from the `ShapBaselineConfig`.

The size of `n_samples` is set by the parameter `NumberOfSamples` in the explainer configuration and the number of features. If the number of features is `n_features`, then `n_samples` is the following: 

```
n_samples = MIN(NumberOfSamples, 2^n_features - 2)
```

The following shows `n_samples` if `NumberOfSamples` is not provided.

```
n_samples = MIN(2*n_features + 2^11, 2^n_features - 2)
```

For example, a tabular record with 10 features has a SHAP baseline size of 1. If `NumberOfSamples` is not provided, the synthetic dataset contains 1022 records. If the record has 20 features, the synthetic dataset contains 2088 records.

For NLP problems, `n_features` is equal to the number of non-text features plus the number of text units.

**Note**  
The `InvokeEndpoint` API has a request timeout limit. If the synthetic dataset is too large, the explainer may not be able to complete the computation within this limit. If necessary, use the previous information to understand and reduce the SHAP baseline size and `NumberOfSamples`. If your model container is set up to handle batch requests, then you can also adjust the value of `MaxRecordCount`.

# Invoke the endpoint
<a name="clarify-online-explainability-invoke-endpoint"></a>

After the endpoint is running, use the SageMaker AI Runtime [InvokeEndpoint ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API in the SageMaker AI Runtime service to send requests to, or invoke the endpoint. In response, the requests are handled as explainability requests by the SageMaker Clarify explainer.

**Note**  
To invoke an endpoint, choose one of the following options:  
For instructions to use Boto3 or the AWS CLI to invoke an endpoint, see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).
To use the SageMaker SDK for Python to invoke an endpoint, see the [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) API.

## Request
<a name="clarify-online-explainability-request"></a>

The `InvokeEndpoint` API has an optional parameter `EnableExplanations`, which is mapped to the HTTP header `X-Amzn-SageMaker-Enable-Explanations`. If this parameter is provided, it overrides the `EnableExplanations` parameter of the `ClarifyExplainerConfig`.

**Note**  
The `ContentType` and `Accept` parameters of the `InvokeEndpoint` API are required. Supported formats include MIME type `text/csv` and `application/jsonlines`.

Use the `sagemaker_runtime_client` to send a request to the endpoint, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName='name-of-your-endpoint',
    EnableExplanations='`true`',
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',  # single record (of four numerical features)
)
```

For multi-model endpoints, pass an additional `TargetModel` parameter in the previous example request to specifies which model to target at the endpoint. The multi-model endpoint dynamically loads target models as needed. For more information about multi-model endpoints, see [Multi-model endpoints](multi-model-endpoints.md). See the [SageMaker Clarify Online Explainability on Multi-Model Endpoint Sample Notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/tabular_multi_model_endpoint/multi_model_xgboost_with_online_explainability.ipynb) for an example of how to set up and invoke multiple target models from a single endpoint.

## Response
<a name="clarify-online-explainability-response"></a>

If the endpoint is created with `ExplainerConfig`, then a new response schema is used, This new schema is different from, and is not compatible with, an endpoint that lacks the `ExplainerConfig` parameter provided.

The MIME type of the response is `application/json`, and the response payload can be decoded from UTF-8 bytes to a JSON object. The following shows the members of this JSON object are as follows:
+ `version`: The version of the response schema in string format. For example, `1.0`.
+ `predictions`: The predictions that the request makes have the following:
  + `content_type`: The MIME type of the predictions, referring to the `ContentType` of the model container response.
  + `data`: The predictions data string delivered as the payload of the model container response for the request.
+ `label_headers`: The label headers from the `LabelHeaders` parameter. This is provided in either the explainer configuration or the model container output.
+ `explanations`: The explanations provided in the request payload. If no records are explained, then this member returns the empty object `{}`.
+ 
  + `kernel_shap`: A key that refers to an array of Kernel SHAP explanations for each record in the request. If a record is not explained, the corresponding explanation is `null`.

The `kernel_shap` element has the following members:
+ `feature_header`: The header name of the features provided by the `FeatureHeaders` parameter in the explainer configuration `ExplainerConfig`.
+ `feature_type`: The feature type inferred by explainer or provided in the `FeatureTypes` parameter in the `ExplainerConfig`. This element is only available for NLP explainability problems.
+ `attributions`: An array of attribution objects. Text features can have multiple attribution objects, each for a unit. The attribution object has the following members:
  + `attribution`: A list of probability values, given for each class.
  + `description`: The description of the text units, available only for NLP explainability problems.
    + `partial_text`: The portion of the text explained by the explainer.
    + `start_idx`: A zero-based index to identify the array location of the beginning of the partial text fragment.

# Code examples: SDK for Python
<a name="clarify-online-explainability-examples"></a>

This section provides sample code to create and invoke an endpoint that uses SageMaker Clarify online explainability. These code examples use the [AWS SDK for Python.](https://aws.amazon.com/sdk-for-python/)

## Tabular data
<a name="clarigy-online-explainability-examples-tabular"></a>

The following example uses tabular data and a SageMaker AI model called `model_name`. In this example, the model container accepts data in CSV format, and each record has four numerical features. In this minimal configuration, **for demonstration purposes only**, the SHAP baseline data is set to zero. Refer to [SHAP Baselines for Explainability](clarify-feature-attribute-shap-baselines.md) to learn how to choose more appropriate values for `ShapBaseline`.

Configure the endpoint, as follows:

```
endpoint_config_name = 'tabular_explainer_endpoint_config'
response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': model_name,
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }],
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'ShapConfig': {
                'ShapBaselineConfig': {
                    'ShapBaseline': '0,0,0,0',
                },
            },
        },
    },
)
```

Use the endpoint configuration to create an endpoint, as follows:

```
endpoint_name = 'tabular_explainer_endpoint'
response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
```

Use the `DescribeEndpoint` API to inspect the progress of creating an endpoint, as follows:

```
response = sagemaker_client.describe_endpoint(
    EndpointName=endpoint_name,
)
response['EndpointStatus']
```

After the endpoint status is "InService", invoke the endpoint with a test record, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',
)
```

**Note**  
In the previous code example, for multi-model endpoints, pass an additional `TargetModel` parameter in the request to specify which model to target at the endpoint.

Assume that the response has a status code of 200 (no error), and load the response body, as follows:

```
import codecs
import json
json.load(codecs.getreader('utf-8')(response['Body']))
```

The default action for the endpoint is to explain the record. The following shows example output in the returned JSON object.

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv; charset=utf-8",
        "data": "0.0006380207487381"
    },
    "explanations": {
        "kernel_shap": [
            [
                {
                    "attributions": [
                        {
                            "attribution": [-0.00433456]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [-0.005369821]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [0.007917749]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [-0.00261214]
                        }
                    ]
                }
            ]
        ]
    }
}
```

Use the `EnableExplanations` parameter to enable on-demand explanations, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',
    EnableExplanations='[0]>`0.8`',
)
```

**Note**  
In the previous code example, for multi-model endpoints, pass an additional `TargetModel` parameter in the request to specify which model to target at the endpoint.

In this example, the prediction value is less than the threshold value of `0.8`, so the record is not explained:

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv; charset=utf-8",
        "data": "0.6380207487381995"
    },
    "explanations": {}
}
```

Use visualization tools to help interpret the returned explanations. The following image shows how SHAP plots can be used to understand how each feature contributes to the prediction. The base value on the diagram, also called the expected value, is the mean predictions of the training dataset. Features that push the expected value higher are red, and features that push the expected value lower are blue. See [SHAP additive force layout](https://shap.readthedocs.io/en/latest/generated/shap.plots.force.html) for additional information.

![\[Example SHAP plot, that can be used to understand how each feature contributes to the prediction.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/force-plot.png)


See the [full example notebook for tabular data](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/tabular/tabular_online_explainability_with_sagemaker_clarify.ipynb). 

## Text data
<a name="clarigy-online-explainability-examples-text"></a>

This section provides a code example to create and invoke an online explainability endpoint for text data. The code example uses SDK for Python.

The following example uses text data and a SageMaker AI model called `model_name`. In this example, the model container accepts data in CSV format, and each record is a single string.

```
endpoint_config_name = 'text_explainer_endpoint_config'
response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': model_name,
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }],
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'InferenceConfig': {
                'FeatureTypes': ['text'],
                'MaxRecordCount': 100,
            },
            'ShapConfig': {
                'ShapBaselineConfig': {
                    'ShapBaseline': '"<MASK>"',
                },
                'TextConfig': {
                    'Granularity': 'token',
                    'Language': 'en',
                },
                'NumberOfSamples': 100,
            },
        },
    },
)
```
+ `ShapBaseline`: A special token reserved for natural language processing (NLP) processing.
+ `FeatureTypes`: Identifies the feature as text. If this parameter is not provided, the explainer will attempt to infer the feature type.
+ `TextConfig`: Specifies the unit of granularity and language for the analysis of text features. In this example, the language is English, and granularity `token` means a word in English text.
+ `NumberOfSamples`: A limit to set the upper bounds of the size of the synthetic dataset.
+ `MaxRecordCount`: The maximum number of records in a request that the model container can handle. This parameter is set to stabilize performance.

Use the endpoint configuration to create the endpoint, as follows:

```
endpoint_name = 'text_explainer_endpoint'
response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
```

After the status of the endpoint becomes `InService`, invoke the endpoint. The following code sample uses a test record as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='"This is a good product"',
)
```

If the request completes successfully, the response body will return a valid JSON object that's similar to the following:

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv",
        "data": "0.9766594\n"
    },
    "explanations": {
        "kernel_shap": [
            [
                {
                    "attributions": [
                        {
                            "attribution": [
                                -0.007270948666666712
                            ],
                            "description": {
                                "partial_text": "This",
                                "start_idx": 0
                            }
                        },
                        {
                            "attribution": [
                                -0.018199033666666628
                            ],
                            "description": {
                                "partial_text": "is",
                                "start_idx": 5
                            }
                        },
                        {
                            "attribution": [
                                0.01970993241666666
                            ],
                            "description": {
                                "partial_text": "a",
                                "start_idx": 8
                            }
                        },
                        {
                            "attribution": [
                                0.1253469515833334
                            ],
                            "description": {
                                "partial_text": "good",
                                "start_idx": 10
                            }
                        },
                        {
                            "attribution": [
                                0.03291143366666657
                            ],
                            "description": {
                                "partial_text": "product",
                                "start_idx": 15
                            }
                        }
                    ],
                    "feature_type": "text"
                }
            ]
        ]
    }
}
```

Use visualization tools to help interpret the returned text attributions. The following image shows how the captum visualization utility can be used to understand how each word contributes to the prediction. The higher the color saturation, the higher the importance given to the word. In this example, a highly saturated bright red color indicates a strong negative contribution. A highly saturated green color indicates a strong positive contribution. The color white indicates that the word has a neutral contribution. See the [captum](https://github.com/pytorch/captum) library for additional information on parsing and rendering the attributions.

![\[Captum visualization utility used to understand how each word contributes to the prediction.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/word-importance.png)


See the [full example notebook for text](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/natural_language_processing/nlp_online_explainability_with_sagemaker_clarify.ipynb) data. 

# Troubleshooting guide
<a name="clarify-online-explainability-troubleshooting"></a>

If you encounter errors using SageMaker Clarify online explainability, consult the topics in this section.

**`InvokeEndpoint` API fails with the error "ReadTimeoutError:Read timeout on endpoint..."** 

This error means that the request could not be completed within the 60-second time limit set by the [request timeout](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

To reduce the request latency, try the following:
+ Tune the model's performance during inference. For example, SageMaker AI [Neo](https://aws.amazon.com/sagemaker/neo/) can optimize models for inference.
+ Allow the model container to handle batch requests.
+ Use a larger `MaxRecordCount` to reduce the number of calls from the explainer to the model container. This will reduce network latency and overhead.
+ Use an instance type that has more resources allocated to it. Alternately, assign more instances to the endpoint to help balance the load.
+ Reduce the number of records inside a single `InvokeEndpoint` request.
+ Reduce the number of records in the baseline data.
+ Use a smaller `NumberOfSamples` value to reduce the size of the synthetic dataset. For more information about how the number of samples affects your synthetic dataset, see [Synthetic dataset](clarify-online-explainability-create-endpoint-synthetic.md).

# Fine-tune models with adapter inference components
<a name="realtime-endpoints-adapt"></a>

With Amazon SageMaker AI, you can host pre-trained foundation models without needing to create your own models from scratch. However, to tailor a general-purpose foundation model for the unique needs of your business, you must create a fine-tuned version of it. One cost-effective fine-tuning technique is Low-Rank Adaptation (LoRA). The principle behind LoRA is that only a small part of a large foundation model needs updating to adapt it to new tasks or domains. A LoRA adapter augments the inference from a base foundation model with just a few extra adapter layers.

If you host your base foundation model by using a SageMaker AI inference component, you can fine-tune that base model with LoRA adapters by creating *adapter inference components*. When you create an adapter inference component, you specify the following:
+ The *base inference component* that is to contain the adapter inference component. The base inference component contains the foundation model that you want to adapt. The adapter inference component uses the compute resources that you assigned to the base inference component.
+ The location where you've stored the LoRA adapter in Amazon S3.

After you create the adapter inference component, you can invoke it directly. When you do, SageMaker AI combines the adapter with the base model to augment the generated response.

**Before you begin**

Before you can create an adapter inference component, you must meet the following requirements: 
+ You have a base inference component that contains the foundation model to adapt. You've deployed this inference component to a SageMaker AI endpoint. 

  For more information about deploying inference components to endpoints, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).
+ You have a LoRA adapter model, and you've stored the model artifacts as a `tar.gz` file in Amazon S3. You specify the S3 URI of the artifacts when you create the adapter inference component.

The following examples use the SDK for Python (Boto3) to create and invoke an adapter inference component.

**Example `create_inference_component` call to create an adapter inference component**  
The following example creates an adapter inference component and assigns it to a base inference component:  

```
sm_client.create_inference_component(
    InferenceComponentName = adapter_ic_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": adapter_s3_uri
        },
    },
)
```
When you use this example in your own code, replace the placeholder values as follows:  
+ *adapter\$1ic\$1name* – A unique name for your adapter inference component.
+ *endpoint\$1name* – The name of the endpoint that hosts the base inference component.
+ *base\$1inference\$1component\$1name* – The name of the base inference component that contains the foundation model to adapt.
+ *adapter\$1s3\$1uri* – The S3 URI that locates the `tar.gz` file with your LoRA adapter artifacts.
You create an adapter inference component with code that is similar to the code for a normal inference component. One difference is that, for the `Specification` parameter, you omit the `ComputeResourceRequirements` key. When you invoke an adapter inference component, it is loaded by the base inference component. The adapter inference component uses the compute resources of the base inference component.  
For more information about creating and deploying inference components with the SDK for Python (Boto3), see [Deploy models with the Python SDKs](realtime-endpoints-deploy-models.md#deploy-models-python).

After you create an adapter inference component, you invoke it by specifying its name in an `invoke_endpoint` request.

**Example `invoke_endpoint` call to invoke an adapter inference component**  
The following example invokes an adapter inference component:  

```
response = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = adapter_ic_name,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_reponse = response["Body"].read().decode("utf8")["generated_text"]
```
When you use this example in your own code, replace the placeholder values as follows:  
+ *endpoint\$1name* – The name of the endpoint that hosts the base and adapter inference components.
+ *adapter\$1ic\$1name* – The name of the adapter inference component.
+ *prompt* – The prompt for the inference request.
For more information about invoking inference components with the SDK for Python (Boto3), see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).

# Deploy models with Amazon SageMaker Serverless Inference
<a name="serverless-endpoints"></a>

Amazon SageMaker Serverless Inference is a purpose-built inference option that enables you to deploy and scale ML models without configuring or managing any of the underlying infrastructure. On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling. With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs. For more information about pricing for on-demand Serverless Inference, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

Optionally, you can also use Provisioned Concurrency with Serverless Inference. Serverless Inference with provisioned concurrency is a cost-effective option when you have predictable bursts in your traffic. Provisioned Concurrency allows you to deploy models on serverless endpoints with predictable performance, and high scalability by keeping your endpoints warm. SageMaker AI ensures that for the number of Provisioned Concurrency that you allocate, the compute resources are initialized and ready to respond within milliseconds. For Serverless Inference with Provisioned Concurrency, you pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. You also pay for Provisioned Concurrency usage, based on the memory configured, duration provisioned, and the amount of concurrency enabled. For more information about pricing for Serverless Inference with Provisioned Concurrency, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can integrate Serverless Inference with your MLOps Pipelines to streamline your ML workflow, and you can use a serverless endpoint to host a model registered with [Model Registry](model-registry.md).

Serverless Inference is generally available in 21 AWS Regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Africa (Cape Town), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Osaka), Asia Pacific (Singapore), Asia Pacific (Sydney), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), Europe (Milan), Middle East (Bahrain), South America (São Paulo). For more information about Amazon SageMaker AI regional availability, see the [AWS Regional Services List](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).

## How it works
<a name="serverless-endpoints-how-it-works"></a>

The following diagram shows the workflow of on-demand Serverless Inference and the benefits of using a serverless endpoint.

![\[Diagram showing the Serverless Inference workflow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-how-it-works.png)


When you create an on-demand serverless endpoint, SageMaker AI provisions and manages the compute resources for you. Then, you can make inference requests to the endpoint and receive model predictions in response. SageMaker AI scales the compute resources up and down as needed to handle your request traffic, and you only pay for what you use.

For Provisioned Concurrency, Serverless Inference also integrates with Application Auto Scaling, so that you can manage Provisioned Concurrency based on a target metric or on a schedule. For more information, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

The following sections provide additional details about Serverless Inference and how it works.

**Topics**
+ [

### Container support
](#serverless-endpoints-how-it-works-containers)
+ [

### Memory size
](#serverless-endpoints-how-it-works-memory)
+ [

### Concurrent invocations
](#serverless-endpoints-how-it-works-concurrency)
+ [

### Minimizing cold starts
](#serverless-endpoints-how-it-works-cold-starts)
+ [

### Feature exclusions
](#serverless-endpoints-how-it-works-exclusions)

### Container support
<a name="serverless-endpoints-how-it-works-containers"></a>

For your endpoint container, you can choose either a SageMaker AI-provided container or bring your own. SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a list of available SageMaker images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). If you are bringing your own container, you must modify it to work with SageMaker AI. For more information about bringing your own container, see [Adapt your own inference container for Amazon SageMaker AI](adapt-inference-container.md).

The maximum size of the container image you can use is 10 GB. For serverless endpoints, we recommend creating only one worker in the container and only loading one copy of the model. Note that this is unlike real-time endpoints, where some SageMaker AI containers may create a worker for each vCPU to process inference requests and load the model in each worker.

If you already have a container for a real-time endpoint, you can use the same container for your serverless endpoint, though some capabilities are excluded. To learn more about the container capabilities that are not supported in Serverless Inference, see [Feature exclusions](#serverless-endpoints-how-it-works-exclusions). If you choose to use the same container, SageMaker AI escrows (retains) a copy of your container image until you delete all endpoints that use the image. SageMaker AI encrypts the copied image at rest with a SageMaker AI-owned AWS KMS key.

### Memory size
<a name="serverless-endpoints-how-it-works-memory"></a>

Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. Choose your endpoint’s memory size according to your model size. Generally, the memory size should be at least as large as your model size. You may need to benchmark in order to choose the right memory selection for your model based on your latency SLAs. For a step by step guide to benchmark, see [ Introducing the Amazon SageMaker Serverless Inference Benchmarking Toolkit](https://aws.amazon.com/blogs/machine-learning/introducing-the-amazon-sagemaker-serverless-inference-benchmarking-toolkit/). The memory size increments have different pricing; see the [Amazon SageMaker AI pricing page](https://aws.amazon.com/sagemaker/pricing/) for more information.

Regardless of the memory size you choose, your serverless endpoint has 5 GB of ephemeral disk storage available. For help with container permissions issues when working with storage, see [Troubleshooting](serverless-endpoints-troubleshooting.md).

### Concurrent invocations
<a name="serverless-endpoints-how-it-works-concurrency"></a>

On-demand Serverless Inference manages predefined scaling policies and quotas for the capacity of your endpoint. Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently.

The total concurrency that you can share between all serverless endpoints in your account depends on your region:
+ For the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions, the total concurrency you can share between all serverless endpoints per Region in your account is 1000.
+ For the US West (N. California), Africa (Cape Town), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Canada (Central), Europe (London), Europe (Milan), Europe (Paris), Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo) Regions, the total concurrency per Region in your account is 500.

You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoints you can host in a Region is 50. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all of the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled.

**Note**  
Provisioned Concurrency that you assign to a serverless endpoint should always be less than or equal to the maximum concurrency that you assigned to that endpoint.

To learn how to set the maximum concurrency for your endpoint, see [Create an endpoint configuration](serverless-endpoints-create-config.md). For more information about quotas and limits, see [ Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) in the *AWS General Reference*. To request a service limit increase, contact [AWS Support](https://console.aws.amazon.com/support). For instructions on how to request a service limit increase, see [Supported Regions and Quotas](regions-quotas.md).

### Minimizing cold starts
<a name="serverless-endpoints-how-it-works-cold-starts"></a>

If your on-demand Serverless Inference endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a *cold start*. Since serverless endpoints provision compute resources on demand, your endpoint may experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.

To monitor how long your cold start time is, you can use the Amazon CloudWatch metric `OverheadLatency` to monitor your serverless endpoint. This metric tracks the time it takes to launch new compute resources for your endpoint. To learn more about using CloudWatch metrics with serverless endpoints, see [Alarms and logs for tracking metrics from serverless endpoints](serverless-endpoints-monitoring.md).

You can minimize cold starts by using Provisioned Concurrency. SageMaker AI keeps the endpoint warm and ready to respond in milliseconds, for the number of Provisioned Concurrency that you allocated.

### Feature exclusions
<a name="serverless-endpoints-how-it-works-exclusions"></a>

Some of the features currently available for SageMaker AI Real-time Inference are not supported for Serverless Inference, including GPUs, AWS marketplace model packages, private Docker registries, Multi-Model Endpoints, VPC configuration, network isolation, data capture, multiple production variants, Model Monitor, and inference pipelines.

You cannot convert your instance-based, real-time endpoint to a serverless endpoint. If you try to update your real-time endpoint to serverless, you receive a `ValidationError` message. You can convert a serverless endpoint to real-time, but once you make the update, you cannot roll it back to serverless.

## Getting started
<a name="serverless-endpoints-get-started"></a>

You can create, update, describe, and delete a serverless endpoint using the SageMaker AI console, the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), and the AWS CLI. You can invoke your endpoint using the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), and the AWS CLI. For serverless endpoints with Provisioned Concurrency, you can use Application Auto Scaling to auto scale Provisioned Concurrency based on a target metric or a schedule. For more information about how to set up and use a serverless endpoint, read the guide [Serverless endpoint operations](serverless-endpoints-create-invoke-update-delete.md). For more information on auto scaling serverless endpoints with Provisioned Concurrency, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

**Note**  
 Application Auto Scaling for Serverless Inference with Provisioned Concurrency is currently not supported on AWS CloudFormation. 

### Example notebooks and blogs
<a name="serverless-endpoints-get-started-nbs"></a>

For Jupyter notebook examples that show end-to-end serverless endpoint workflows, see the [Serverless Inference example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/serverless-inference).

# Serverless endpoint operations
<a name="serverless-endpoints-create-invoke-update-delete"></a>

Unlike other SageMaker AI real-time endpoints, Serverless Inference manages compute resources for you, reducing complexity so you can focus on your ML model instead of on managing infrastructure. The following guide highlights the key capabilities of serverless endpoints: how to create, invoke, update, describe, or delete an endpoint. You can use the SageMaker AI console, the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), or the AWS CLI to manage your serverless endpoints.

**Topics**
+ [

# Complete the prerequisites
](serverless-endpoints-prerequisites.md)
+ [

# Serverless endpoint creation
](serverless-endpoints-create.md)
+ [

# Invoke a serverless endpoint
](serverless-endpoints-invoke.md)
+ [

# Update a serverless endpoint
](serverless-endpoints-update.md)
+ [

# Describe a serverless endpoint
](serverless-endpoints-describe.md)
+ [

# Delete a serverless endpoint
](serverless-endpoints-delete.md)

# Complete the prerequisites
<a name="serverless-endpoints-prerequisites"></a>

The following topic describes the prerequisites that you must complete before creating a serverless endpoint. These prerequisites include properly storing your model artifacts, configuring an AWS IAM with the correct permissions, and selecting a container image.

**To complete the prerequisites**

1. **Set up an AWS account.** You first need an AWS account and an AWS Identity and Access Management administrator user. For instructions on how to set up an AWS account, see [How do I create and activate a new AWS account?](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). For instructions on how to secure your account with an IAM administrator user, see [Creating your first IAM admin user and user group](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html) in the *IAM User Guide*.

1. **Create an Amazon S3 bucket.** You use an Amazon S3 bucket to store your model artifacts. To learn how to create a bucket, see [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) in the *Amazon S3 User Guide*.

1. **Upload your model artifacts to your S3 bucket.** For instructions on how to upload your model to your bucket, see [Upload an object to your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) in the *Amazon S3 User Guide*.

1. **Create an IAM role for Amazon SageMaker AI.** Amazon SageMaker AI needs access to the S3 bucket that stores your model. Create an IAM role with a policy that gives SageMaker AI read access to your bucket. The following procedure shows how to create a role in the console, but you can also use the [CreateRole](https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateRole.html) API from the *IAM User Guide*. For information on giving your role more granular permissions based on your use case, see [How to use SageMaker AI execution roles](sagemaker-roles.md#sagemaker-roles-createmodel-perms).

   1. Sign in to the [IAM console](https://console.aws.amazon.com/iam/).

   1. In the navigation tab, choose **Roles**.

   1. Choose **Create Role**.

   1. For **Select type of trusted entity**, choose **AWS service** and then choose **SageMaker AI**.

   1. Choose **Next: Permissions** and then choose **Next: Tags**.

   1. (Optional) Add tags as key-value pairs if you want to have metadata for the role.

   1. Choose **Next: Review**.

   1.  For **Role name**, enter a name for the new role that is unique within your AWS account. You cannot edit the role name after creating the role.

   1. (Optional) For **Role description**, enter a description for the new role.

   1. Choose **Create role**.

1. **Attach S3 bucket permissions to your SageMaker AI role.** After creating an IAM role, attach a policy that gives SageMaker AI permission to access the S3 bucket containing your model artifacts.

   1. In the IAM console navigation tab, choose **Roles**.

   1. From the list of roles, search for the role you created in the previous step by name.

   1. Choose your role, and then choose **Attach policies**.

   1. For **Attach permissions**, choose **Create policy**.

   1. In the **Create policy** view, select the **JSON** tab.

   1. Add the following policy statement into the JSON editor. Make sure to replace `<your-bucket-name>` with the name of the S3 bucket that stores your model artifacts. If you want to restrict the access to a specific folder or file in your bucket, you can also specify the Amazon S3 folder path, for example, `<your-bucket-name>/<model-folder>`.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": "s3:GetObject",
                  "Resource": "arn:aws:s3:::<your-bucket-name>/*"
              }
          ]
      }
      ```

------

   1. Choose **Next: Tags**.

   1. (Optional) Add tags in key-value pairs to the policy.

   1. Choose **Next: Review**.

   1. For **Name**, enter a name for the new policy.

   1. (Optional) Add a **Description** for the policy.

   1. Choose **Create policy**.

   1. After creating the policy, return to **Roles** in the [IAM console](https://console.aws.amazon.com/iam/) and select your SageMaker AI role.

   1. Choose **Attach policies**.

   1. For **Attach permissions**, search for the policy you created by name. Select it and choose **Attach policy**.

1. **Select a prebuilt Docker container image or bring your own.** The container you choose serves inference on your endpoint. SageMaker AI provides containers for built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

   If none of the existing SageMaker AI containers meet your needs, you may need to create your own Docker container. For information about how to create your Docker image and make it compatible with SageMaker AI, see [Containers with custom inference code](your-algorithms-inference-main.md). To use your container with a serverless endpoint, the container image must reside in an Amazon ECR repository within the same AWS account that creates the endpoint.

1. **(Optional) Register your model with Model Registry.** [SageMaker Model Registry](model-registry.md) helps you catalog and manage versions of your models for use in ML pipelines. For more information about registering a version of your model, see [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md). For an example of a Model Registry and Serverless Inference workflow, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/serverless-model-registry.ipynb).

1. **(Optional) Bring an AWS KMS key.** When setting up a serverless endpoint, you have the option to specify a KMS key that SageMaker AI uses to encrypt your Amazon ECR image. Note that the key policy for the KMS key must grant access to the IAM role you specify when setting up your endpoint. To learn more about KMS keys, see the [AWS Key Management Service Developer Guide](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

# Serverless endpoint creation
<a name="serverless-endpoints-create"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To create a serverless endpoint, you can use the Amazon SageMaker AI console, the APIs, or the AWS CLI. You can create a serverless endpoint using a similar process as a [real-time endpoint](realtime-endpoints.md).

**Topics**
+ [

# Create a model
](serverless-endpoints-create-model.md)
+ [

# Create an endpoint configuration
](serverless-endpoints-create-config.md)
+ [

# Create an endpoint
](serverless-endpoints-create-endpoint.md)

# Create a model
<a name="serverless-endpoints-create-model"></a>

To create your model, you must provide the location of your model artifacts and container image. You can also use a model version from [SageMaker Model Registry](model-registry.md). The examples in the following sections show you how to create a model using the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API, Model Registry, and the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

## To create a model (using Model Registry)
<a name="serverless-endpoints-create-model-registry"></a>

[Model Registry](model-registry.md) is a feature of SageMaker AI that helps you catalog and manage versions of your model for use in ML pipelines. To use Model Registry with Serverless Inference, you must first register a model version in a Model Registry model group. To learn how to register a model in Model Registry, follow the procedures in [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md).

The following example requires you to have the ARN of a registered model version and uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. For Serverless Inference, Model Registry is currently only supported by the AWS SDK for Python (Boto3). For the example, specify the following values:
+ For `model_name`, enter a name for the model.
+ For `sagemaker_role`, you can use the default SageMaker AI-created role or a customized SageMaker AI IAM role from Step 4 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section.
+ For `ModelPackageName`, specify the ARN for your model version, which must be registered to a model group in Model Registry.

```
#Setup
import boto3
import sagemaker
region = boto3.Session().region_name
client = boto3.client("sagemaker", region_name=region)

#Role to give SageMaker AI permission to access AWS services.
sagemaker_role = sagemaker.get_execution_role()

#Specify a name for the model
model_name = "<name-for-model>"

#Specify a Model Registry model version
container_list = [
    {
        "ModelPackageName": <model-version-arn>
     }
 ]

#Create the model
response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    container_list
)
```

## To create a model (using API)
<a name="serverless-endpoints-create-model-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. Specify the following values:
+ For `sagemaker_role,` you can use the default SageMaker AI-created role or a customized SageMaker AI IAM role from Step 4 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section.
+ For `model_url`, specify the Amazon S3 URI to your model.
+ For `container`, retrieve the container you want to use by its Amazon ECR path. This example uses a SageMaker AI-provided XGBoost container. If you have not selected a SageMaker AI container or brought your own, see Step 6 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section for more information.
+ For `model_name`, enter a name for the model.

```
#Setup
import boto3
import sagemaker
region = boto3.Session().region_name
client = boto3.client("sagemaker", region_name=region)

#Role to give SageMaker AI permission to access AWS services.
sagemaker_role = sagemaker.get_execution_role()

#Get model from S3
model_url = "s3://amzn-s3-demo-bucket/models/model.tar.gz"

#Get container image (prebuilt example)
from sagemaker import image_uris
container = image_uris.retrieve("xgboost", region, "0.90-1")

#Create model
model_name = "<name-for-model>"

response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    Containers = [{
        "Image": container,
        "Mode": "SingleModel",
        "ModelDataUrl": model_url,
    }]
)
```

## To create a model (using the console)
<a name="serverless-endpoints-create-model-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Models**.

1. Choose **Create model**.

1. For **Model name**, enter a name for the model that is unique to your account and AWS Region.

1. For **IAM role**, either select an IAM role you have already created (see [Complete the prerequisites](serverless-endpoints-prerequisites.md)) or allow SageMaker AI to create one for you.

1. In **Container definition 1**, for **Container input options**, select **Provide model artifacts and input location**.

1. For **Provide model artifacts and inference image options**, select **Use a single model**.

1. For **Location of inference code image**, enter an Amazon ECR path to a container. The image must either be a SageMaker AI-provided first party image (e.g. TensorFlow, XGBoost) or an image that resides in an Amazon ECR repository within the same account in which you are creating the endpoint. If you do not have a container, go back to Step 6 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section for more information.

1. For **Location of model artifacts**, enter the Amazon S3 URI to your ML model. For example, `s3://amzn-s3-demo-bucket/models/model.tar.gz`.

1. (Optional) For **Tags**, add key-value pairs to create metadata for your model.

1. Choose **Create model**.

# Create an endpoint configuration
<a name="serverless-endpoints-create-config"></a>

After you create a model, create an endpoint configuration. You can then deploy your model using the specifications in your endpoint configuration. In the configuration, you specify whether you want a real-time or serverless endpoint. To create a serverless endpoint configuration, you can use the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API, or the AWS CLI. The API and console approaches are outlined in the following sections.

## To create an endpoint configuration (using API)
<a name="serverless-endpoints-create-config-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. Specify the following values:
+ For `EndpointConfigName`, choose a name for the endpoint configuration. The name should be unique within your account in a Region.
+ (Optional) For `KmsKeyId`, use the key ID, key ARN, alias name, or alias ARN for an AWS KMS key that you want to use. SageMaker AI uses this key to encrypt your Amazon ECR image.
+ For `ModelName`, use the name of the model you want to deploy. It should be the same model that you used in the [Create a model](serverless-endpoints-create-model.md) step.
+ For `ServerlessConfig`:
  + Set `MemorySizeInMB` to `2048`. For this example, we set the memory size to 2048 MB, but you can choose any of the following values for your memory size: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. 
  + Set `MaxConcurrency` to `20`. For this example, we set the maximum concurrency to 20. The maximum number of concurrent invocations you can set for a serverless endpoint is 200, and the minimum value you can choose is 1.
  + (Optional) To use Provisioned Concurrency, set `ProvisionedConcurrency` to 10. For this example, we set the Provisioned Concurrency to 10. The `ProvisionedConcurrency` number for a serverless endpoint must be lower than or equal to the `MaxConcurrency` number. You can leave it empty if you want to use on-demand Serverless Inference endpoint. You can dynamically scale Provision Concurrency. For more information, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

```
response = client.create_endpoint_config(
   EndpointConfigName="<your-endpoint-configuration>",
   KmsKeyId="arn:aws:kms:us-east-1:123456789012:key/143ef68f-76fd-45e3-abba-ed28fc8d3d5e",
   ProductionVariants=[
        {
            "ModelName": "<your-model-name>",
            "VariantName": "AllTraffic",
            "ServerlessConfig": {
                "MemorySizeInMB": 2048,
                "MaxConcurrency": 20,
                "ProvisionedConcurrency": 10,
            }
        } 
    ]
)
```

## To create an endpoint configuration (using the console)
<a name="serverless-endpoints-create-config-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoint configurations**.

1. Choose **Create endpoint configuration**.

1. For **Endpoint configuration name**, enter a name that is unique within your account in a Region.

1. For **Type of endpoint**, select **Serverless**.  
![\[Screenshot of the endpoint type option in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-endpoint-config.png)

1. For **Production variants**, choose **Add model**.

1. Under **Add model**, select the model you want to use from the list of models and then choose **Save**.

1. After adding your model, under **Actions**, choose **Edit**.

1. For **Memory size**, choose the memory size you want in GB.  
![\[Screenshot of the memory size option in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-endpoint-config-2.png)

1. For **Max Concurrency**, enter your desired maximum concurrent invocations for the endpoint. The maximum value you can enter is 200 and the minimum is 1.

1. (Optional) To use Provisioned Concurrency, enter the desired number of concurrent invocations in the **Provisioned Concurrency setting** field. The number of provisioned concurrent invocations must be less than or equal to the number of maximum concurrent invocations.

1. Choose **Save**.

1. (Optional) For **Tags**, enter key-value pairs if you want to create metadata for your endpoint configuration.

1. Choose **Create endpoint configuration**.

# Create an endpoint
<a name="serverless-endpoints-create-endpoint"></a>

To create a serverless endpoint, you can use the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API, or the AWS CLI. The API and console approaches are outlined in the following sections. Once you create your endpoint, it can take a few minutes for the endpoint to become available.

## To create an endpoint (using API)
<a name="serverless-endpoints-create-endpoint-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API. Specify the following values:
+ For `EndpointName`, enter a name for the endpoint that is unique within a Region in your account.
+ For `EndpointConfigName`, use the name of the endpoint configuration that you created in the previous section.

```
response = client.create_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-endpoint-config>"
)
```

## To create an endpoint (using the console)
<a name="serverless-endpoints-create-endpoint-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. Choose **Create endpoint**.

1. For **Endpoint name**, enter a name than is unique within a Region in your account.

1. For **Attach endpoint configuration**, select **Use an existing endpoint configuration**.

1. For **Endpoint configuration**, select the name of the endpoint configuration you created in the previous section and then choose **Select endpoint configuration**.

1. (Optional) For **Tags**, enter key-value pairs if you want to create metadata for your endpoint.

1. Choose **Create endpoint**.  
![\[Screenshot of the create and configure endpoint page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-create.png)

# Invoke a serverless endpoint
<a name="serverless-endpoints-invoke"></a>

In order to perform inference using a serverless endpoint, you must send an HTTP request to the endpoint. You can use the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API or the AWS CLI, which make a `POST` request to invoke your endpoint. The maximum request and response payload size for serverless invocations is 4 MB. For serverless endpoints:
+ The model must download and the server must respond successfully to `/ping` within 3 minutes.
+ The timeout for the container to respond to inference requests to `/invocations` is 1 minute.

## To invoke an endpoint
<a name="serverless-endpoints-invoke-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API. Note that unlike the other API calls in this guide, for `InvokeEndpoint`, you must use SageMaker Runtime Runtime as the client. Specify the following values:
+ For `endpoint_name`, use the name of the in-service serverless endpoint you want to invoke.
+ For `content_type`, specify the MIME type of your input data in the request body (for example, `application/json`).
+ For `payload`, use your request payload for inference. Your payload should be in bytes or a file-like object.

```
runtime = boto3.client("sagemaker-runtime")

endpoint_name = "<your-endpoint-name>"
content_type = "<request-mime-type>"
payload = <your-request-body>

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=payload
)
```

# Update a serverless endpoint
<a name="serverless-endpoints-update"></a>

Before updating your endpoint, create a new endpoint configuration or use an existing endpoint configuration. The endpoint configuration is where you specify the changes for your update. Then, you can update your endpoint with the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, or the AWS CLI. The process for updating a serverless endpoint is the same as the process for updating a [real-time endpoint](realtime-endpoints.md). Note that when updating your endpoint, you can experience cold starts when making requests to the endpoint because SageMaker AI must re-initialize your container and model.

You may want to update an on-demand serverless endpoint to a serverless endpoint with provisioned concurrency or adjust the Provisioned Concurrency value for an existing serverless endpoint with provisioned concurrency. For both cases, you will have to create a new serverless endpoint configuration with the desired value for Provisioned Concurrency, and apply `UpdateEndpoint` to the existing serverless endpoint. For more information on creating a new serverless endpoint configuration with Provisioned Concurrency, see [Create an endpoint configuration](serverless-endpoints-create-config.md).

If you want to remove Provisioned Concurrency from a serverless endpoint, you will have to create a new endpoint configuration without specifying any value for Provisioned Concurrency, and then apply `UpdateEndpoint` to the endpoint.

**Note**  
Updating a real-time inference endpoint to either an on-demand serverless endpoint or a serverless endpoint with Provisioned Concurrency is currently not supported.

## Update the endpoint
<a name="serverless-endpoints-update-endpoint"></a>

After creating a new serverless endpoint configuration you can use the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) or the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/) to update an existing serverless endpoint. Examples of how to update your endpoint using the AWS SDK for Python (Boto3) and the SageMaker AI console are outlined in the following sections.

### To update the endpoint (using Boto3)
<a name="serverless-endpoints-update-endpoint-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [update\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_endpoint.html) method. Specify at least the following parameters when calling the method:
+ For `EndpointName`, use the name of the endpoint you’re updating.
+ For `EndpointConfigName`, use the name of the endpoint configuration that you want to use for the update.

```
response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<new-endpoint-config>",
)
```

### To update the endpoint (using the console)
<a name="serverless-endpoints-update-endpoint-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, select the endpoint you want to update.

1. Choose **Change** in **Endpoint configuration settings** section.

1. For **Change the Endpoint configuration**, choose **Use an existing endpoint configuration**.

1. From the list of endpoint configurations, select the one you want to use for your update.

1. Choose **Select endpoint configuration**.

1. Choose **Update endpoint**.

# Describe a serverless endpoint
<a name="serverless-endpoints-describe"></a>

You might want to retrieve information about your endpoint, including details such as the endpoint’s ARN, current status, deployment configuration, and failure reasons. You can find information about your endpoint using the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API, or the AWS CLI.

## To describe an endpoint (using API)
<a name="serverless-endpoints-describe-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#id309) to call the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. For `EndpointName`, use the name of the endpoint you want to check.

```
response = client.describe_endpoint(
    EndpointName="<your-endpoint-name>",
)
```

## To describe an endpoint (using the console)
<a name="serverless-endpoints-describe-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, choose the endpoint you want to check.

The endpoint page contains the information about your endpoint.

# Delete a serverless endpoint
<a name="serverless-endpoints-delete"></a>

You can delete your serverless endpoint using the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API, or the AWS CLI. The following examples show you how to delete your endpoint through the API and the SageMaker AI console.

## To delete an endpoint (using API)
<a name="serverless-endpoints-delete-api"></a>

The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API. For `EndpointName`, use the name of the serverless endpoint you want to delete.

```
response = client.delete_endpoint(
    EndpointName="<your-endpoint-name>",
)
```

## To delete an endpoint (using the console)
<a name="serverless-endpoints-delete-console"></a>

1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, select the endpoint you want to delete.

1. Choose the **Actions** drop-down list, and then choose **Delete**.

1. When prompted again, choose **Delete**.

Your endpoint should now begin the deletion process.

# Alarms and logs for tracking metrics from serverless endpoints
<a name="serverless-endpoints-monitoring"></a>

To monitor your serverless endpoint, you can use Amazon CloudWatch alarms. CloudWatch is a service that collects metrics in real time from your AWS applications and resources. An alarm watches metrics as they are collected and gives you the ability to pre-specify a threshold and the actions to take if that threshold is breached. For example, your CloudWatch alarm can send you a notification if your endpoint breaches an error threshold. By setting up CloudWatch alarms, you gain visibility into the performance and functionality of your endpoint. For more information about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

## Monitoring with CloudWatch
<a name="serverless-endpoints-monitoring-metrics"></a>

The metrics below are an exhaustive list of metrics for serverless endpoints. Any metric not listed below is not published for serverless endpoints. For information about the following metrics, see [Monitor Amazon SageMaker AI with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

### Common endpoint metrics
<a name="serverless-endpoints-monitoring-metrics-common"></a>

These CloudWatch metrics are the same as the metrics published for real-time endpoints.

The `OverheadLatency` metric tracks all additional latency that SageMaker AI added which includes the cold start time for launching new compute resources for your serverless endpoint. Compared to on-demand serverless endpoints, the `OverheadLatency` for serverless endpoints with provision concurrency is generally significantly less.

Serverless endpoints can also use the `Invocations4XXErrors`, `Invocations5XXErrors`, `Invocations`, `ModelLatency`, `ModelSetupTime` and `MemoryUtilization` metrics. To learn more about these metrics, see [SageMaker AI endpoint invocation metrics](monitoring-cloudwatch.md#cloudwatch-metrics-endpoint-invocation).

### Common serverless endpoint metrics
<a name="serverless-endpoints-monitoring-metrics-serverless"></a>

These CloudWatch metrics are published for both on-demand serverless endpoints and serverless endpoint with Provisioned Concurrency.


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| ServerlessConcurrentExecutionsUtilization | The number of concurrent executions divided by the maximum concurrency. | Units: NoneValid statistics: Average, Max, Min | 

### Serverless endpoint with Provisioned Concurrency metrics
<a name="serverless-endpoints-monitoring-metrics-serverless-pc"></a>

These CloudWatch metrics are published for serverless endpoints with Provisioned Concurrency.


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| ServerlessProvisionedConcurrencyExecutions | The number of concurrent executions handled by the endpoint. | Units: CountValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencyUtilization | The number of concurrent executions divided by the allocated Provisioned Concurrency. | Units: NoneValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencyInvocations | The number of InvokeEndpoint requests handled by Provisioned Concurrency. | Units: CountValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencySpilloverInvocations | The number of InvokeEndpoint requests not handled by Provisioned Concurrency, that is handled by on-demand Serverless Inference. | Units: CountValid statistics: Average, Max, Min | 

## Logs
<a name="serverless-endpoints-monitoring-logs"></a>

If you want to monitor the logs from your endpoint for debugging or progress analysis, you can use Amazon CloudWatch Logs. The SageMaker AI-provided log group that you can use for serverless endpoints is `/aws/sagemaker/Endpoints/[EndpointName]`. For more information about using CloudWatch Logs in SageMaker AI, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md). To learn more about CloudWatch Logs, see [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) in the *Amazon CloudWatch Logs User Guide*.

# Automatically scale Provisioned Concurrency for a serverless endpoint
<a name="serverless-endpoints-autoscale"></a>

 Amazon SageMaker AI automatically scales in or out on-demand serverless endpoints. For serverless endpoints with Provisioned Concurrency you can use Application Auto Scaling to scale up or down the Provisioned Concurrency based on your traffic profile, thus optimizing costs. 

 The following are the prerequisites to autoscale Provisioned Concurrency on serverless endpoints: 
+ [Register a model](#serverless-endpoints-autoscale-register)
+ [Define a scaling policy](#serverless-endpoints-autoscale-define)
+ [Apply a scaling policy](#serverless-endpoints-autoscale-apply)

 Before you can use autoscaling, you must have already deployed a model to a serverless endpoint with Provisioned Concurrency. Deployed models are referred to as [production variants](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). See [Create an endpoint configuration](serverless-endpoints-create-config.md) and [Create an endpoint](serverless-endpoints-create-endpoint.md) for more information about deploying a model to a serverless endpoint with Provisioned Concurrency. To specify the metrics and target values for a scaling policy, you must configure a scaling policy. For more information on how to define a scaling policy, see [Define a scaling policy](#serverless-endpoints-autoscale-define). After registering your model and defining a scaling policy, apply the scaling policy to the registered model. For information on how to apply the scaling policy, see [Apply a scaling policy](#serverless-endpoints-autoscale-apply). 

 For details on other prerequisites and components used with autoscaling, see the [Auto scaling prerequisites](endpoint-auto-scaling-prerequisites.md) section in the [SageMaker AI autoscaling documentation](endpoint-auto-scaling.md). 

## Register a model
<a name="serverless-endpoints-autoscale-register"></a>

 To add autoscaling to a serverless endpoint with Provisioned Concurrency, you first must register your model (production variant) using AWS CLI or Application Auto Scaling API. 

### Register a model (AWS CLI)
<a name="serverless-endpoints-autoscale-register-cli"></a>

 To register your model, use the `register-scalable-target` AWS CLI command with the following parameters: 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--resource-id` – The resource identifier for the model (specifically the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--min-capacity` – The minimum number of Provisioned Concurrency for the model. Set `--min-capacity` to at least 1. It must be equal to or less than the value specified for `--max-capacity`. 
+  `--max-capacity` – The maximum number of Provisioned Concurrency that should be enabled through Application Auto Scaling. Set `--max-capacity` to a minimum of 1. It must be greater than or equal to the value specified for `--min-capacity`. 

 The following example shows how to register a model named `MyVariant` that is dynamically scaled to have 1 to 10 Provisioned Concurrency value: 

```
aws application-autoscaling register-scalable-target \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --min-capacity 1 \
    --max-capacity 10
```

### Register a model (Application Auto Scaling API)
<a name="serverless-endpoints-autoscale-register-api"></a>

 To register your model, use the `RegisterScalableTarget` Application Auto Scaling API action with the following parameters: 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ResourceId` – The resource identifier for the model (specifically the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `MinCapacity` – The minimum number of Provisioned Concurrency for the model. Set `MinCapacity` to at least 1. It must be equal to or less than the value specified for `MaxCapacity`. 
+  `MaxCapacity` – The maximum number of Provisioned Concurrency that should be enabled through Application Auto Scaling. Set `MaxCapacity` to a minimum of 1. It must be greater than or equal to the value specified for `MinCapacity`. 

 The following example shows how to register a model named `MyVariant` that is dynamically scaled to have 1 to 10 Provisioned Concurrency value: 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.RegisterScalableTarget
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndPoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "MinCapacity": 1,
    "MaxCapacity": 10
}
```

## Define a scaling policy
<a name="serverless-endpoints-autoscale-define"></a>

 To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You can then use that text file when invoking the AWS CLI or the Application Auto Scaling API. To quickly define a target-tracking scaling policy for a serverless endpoint, use the `SageMakerVariantProvisionedConcurrencyUtilization` predefined metric. 

```
{
    "TargetValue": 0.5,
    "PredefinedMetricSpecification": 
    {
        "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
    },
    "ScaleOutCooldown": 1,
    "ScaleInCooldown": 1
}
```

## Apply a scaling policy
<a name="serverless-endpoints-autoscale-apply"></a>

 After registering your model, you can apply a scaling policy to your serverless endpoint with Provisioned Concurrency. See [Apply a target-tracking scaling policy](#serverless-endpoints-autoscale-apply-target) to apply a target-tracking scaling policy that you have defined. If the traffic flow to your serverless endpoint has a predictable routine then instead of applying a target-tracking scaling policy you might want to schedule scaling actions at specific times. For more information on scheduling scaling actions, see [Scheduled scaling](#serverless-endpoints-autoscale-apply-scheduled). 

### Apply a target-tracking scaling policy
<a name="serverless-endpoints-autoscale-apply-target"></a>

 You can use the AWS Management Console, AWS CLI or the Application Auto Scaling API to apply a target-tracking scaling policy to your serverless endpoint with Provisioned Concurrency. 

#### Apply a target-tracking scaling policy (AWS CLI)
<a name="serverless-endpoints-autoscale-apply-target-cli"></a>

 To apply a scaling policy to your model, use the `put-scaling-policy` AWS CLI; command with the following parameters: 
+  `--policy-name` – The name of the scaling policy. 
+  `--policy-type` – Set this value to `TargetTrackingScaling`. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--target-tracking-scaling-policy-configuration` – The target-tracking scaling policy configuration to use for the model. 

 The following example shows how to apply a target-tracking scaling policy named `MyScalingPolicy` to a model named `MyVariant`. The policy configuration is saved in a file named `scaling-policy.json`. 

```
aws application-autoscaling put-scaling-policy \
    --policy-name MyScalingPolicy \
    --policy-type TargetTrackingScaling \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --target-tracking-scaling-policy-configuration file://[file-localtion]/scaling-policy.json
```

#### Apply a target-tracking scaling policy (Application Auto Scaling API)
<a name="serverless-endpoints-autoscale-apply-target-api"></a>

 To apply a scaling policy to your model, use the `PutScalingPolicy` Application Auto Scaling API action with the following parameters: 
+  `PolicyName` – The name of the scaling policy. 
+  `PolicyType` – Set this value to `TargetTrackingScaling`. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `TargetTrackingScalingPolicyConfiguration` – The target-tracking scaling policy configuration to use for the model. 

 The following example shows how to apply a target-tracking scaling policy named `MyScalingPolicy` to a model named `MyVariant`. The policy configuration is saved in a file named `scaling-policy.json`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.PutScalingPolicy
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "MyScalingPolicy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "PolicyType": "TargetTrackingScaling",
    "TargetTrackingScalingPolicyConfiguration": 
    {
        "TargetValue": 0.5,
        "PredefinedMetricSpecification": 
        {
            "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
        }
    }
}
```

#### Apply a target-tracking scaling policy (AWS Management Console)
<a name="serverless-endpoints-autoscale-apply-target-console"></a>

 To apply a target-tracking scaling policy with the AWS Management Console: 

1.  Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the navigation panel, choose **Inference**. 

1.  Choose **Endpoints** to view a list of all of your endpoints. 

1.  Choose the endpoint to which you want to apply the scaling policy. A page with the settings of the endpoint will appear, with the models (production variant) listed under **Endpoint runtime settings section**. 

1.  Select the production variant to which you want to apply the scaling policy, and choose **Configure auto scaling**. The **Configure variant automatic scaling** dialog box appears.   
![\[Screenshot of the configure variant automatic scaling dialog box in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-variant-autoscaling.png)

1.  Enter the minimum and maximum Provisioned Concurrency values in the **Minimum provisioned concurrency** and **Maximum provisioned concurrency** fields, respectively, in the **Variant automatic scaling** section. Minimum Provisioned Concurrency must be less than or equal to maximum Provisioned Concurrency. 

1.  Enter the target value in the **Target value** field for the target metric, `SageMakerVariantProvisionedConcurrencyUtilization`. 

1.  (Optional) Enter scale in cool down and scale out cool down values (in seconds) in **Scale in cool down** and **Scale out cool down** fields respectively. 

1.  (Optional) Select **Disable scale in** if you don’t want auto scaling to delete instance when traffic decreases. 

1.  Select **Save**. 

### Scheduled scaling
<a name="serverless-endpoints-autoscale-apply-scheduled"></a>

 If the traffic to your serverless endpoint with Provisioned Concurrency follows a routine pattern you might want to schedule scaling actions at specific times, to scale in or scale out Provisioned Concurrency. You can use the AWS CLI or the Application Auto Scaling to schedule scaling actions. 

#### Scheduled scaling (AWS CLI)
<a name="serverless-endpoints-autoscale-apply-scheduled-cli"></a>

 To apply a scaling policy to your model, use the `put-scheduled-action` AWS CLI; command with the following parameters: 
+  `--schedule-action-name` – The name of the scaling action. 
+  `--schedule` – A cron expression that specifies the start and end times of the scaling action with a recurring schedule. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--scalable-target-action` – The target of the scaling action. 

 The following example shows how to add a scaling action named `MyScalingAction` to a model named `MyVariant` on a recurring schedule. On the specified schedule (every day at 12:15 PM UTC), if the current Provisioned Concurrency is below the value specified for `MinCapacity`. Application Auto Scaling scales out the Provisioned Concurrency to the value specified by `MinCapacity`. 

```
aws application-autoscaling put-scheduled-action \
    --scheduled-action-name 'MyScalingAction' \
    --schedule 'cron(15 12 * * ? *)' \
    --service-namespace sagemaker \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --scalable-target-action 'MinCapacity=10'
```

#### Scheduled scaling (Application Auto Scaling API)
<a name="serverless-endpoints-autoscale-apply-scheduled-api"></a>

 To apply a scaling policy to your model, use the `PutScheduledAction` Application Auto Scaling API action with the following parameters: 
+  `ScheduleActionName` – The name of the scaling action. 
+  `Schedule` – A cron expression that specifies the start and end times of the scaling action with a recurring schedule. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `ScalableTargetAction` – The target of the scaling action. 

 The following example shows how to add a scaling action named `MyScalingAction` to a model named `MyVariant` on a recurring schedule. On the specified schedule (every day at 12:15 PM UTC), if the current Provisioned Concurrency is below the value specified for `MinCapacity`. Application Auto Scaling scales out the Provisioned Concurrency to the value specified by `MinCapacity`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.PutScheduledAction
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ScheduledActionName": "MyScalingAction",
    "Schedule": "cron(15 12 * * ? *)",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "ScalableTargetAction": "MinCapacity=10"
        }
    }
}
```

# Clean up
<a name="serverless-endpoints-autoscale-cleanup"></a>

 After you have finished using autoscaling for your serverless endpoint with Provisioned Concurrency, you should clean up the resources you created. This involves deleting the scaling policy and deregistering the model from Application Auto Scaling. Cleaning up ensures that you don't incur unnecessary costs for resources you're no longer using. 

## Delete a scaling policy
<a name="serverless-endpoints-autoscale-delete"></a>

 You can delete a scaling policy with the AWS Management Console, the AWS CLI, or the Application Auto Scaling API. For more information on deleting a scaling policy with the AWS Management Console, see [Delete a scaling policy](endpoint-auto-scaling-delete.md) in the [SageMaker AI autoscaling documentation](endpoint-auto-scaling.md). 

### Delete a scaling policy (AWS CLI)
<a name="serverless-endpoints-autoscale-delete-cli"></a>

 To apply a scaling policy to your model, use the `delete-scaling-policy` AWS CLI; command with the following parameters: 
+  `--policy-name` – The name of the scaling policy. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example deletes scaling policy named `MyScalingPolicy` from a model named `MyVariant`. 

```
aws application-autoscaling delete-scaling-policy \
    --policy-name MyScalingPolicy \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant
```

### Delete a scaling policy (Application Auto Scaling API)
<a name="serverless-endpoints-autoscale-delete-api"></a>

 To delete a scaling policy to your model, use the `DeleteScalingPolicy` Application Auto Scaling API action with the following parameters: 
+  `PolicyName` – The name of the scaling policy. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example uses the Application Auto Scaling API to delete a scaling policy named `MyScalingPolicy` from a model named `MyVariant`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeleteScalingPolicy
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "MyScalingPolicy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
}
```

## Deregister a model
<a name="serverless-endpoints-autoscale-deregister"></a>

 You can deregister a model with the AWS Management Console, the AWS CLI, or the Application Auto Scaling API. 

### Deregister a model (AWS CLI)
<a name="serverless-endpoints-deregister-model-cli"></a>

 To deregister a model from Application Auto Scaling, use the `deregister-scalable-target` AWS CLI; command with the following parameters: 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example deregisters a model named `MyVariant` from Application Auto Scaling. 

```
aws application-autoscaling deregister-scalable-target \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant
```

### Deregister a model (Application Auto Scaling API)
<a name="serverless-endpoints-autoscale-deregister-api"></a>

 To deregister a model from Application Auto Scaling use the `DeregisterScalableTarget` Application Auto Scaling API action with the following parameters: 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example uses the Application Auto Scaling API to deregister a model named `MyVariant` from Application Auto Scaling. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeregisterScalableTarget
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
}
```

### Deregister a model (AWS Management Console)
<a name="serverless-endpoints-autoscale-deregister-console"></a>

 To deregister a model (production variant) with the AWS Management Console: 

1.  Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the navigational panel, choose **Inference**. 

1.  Choose **Endpoints** to view a list of your endpoints. 

1.  Choose the serverless endpoint hosting the production variant. A page with the settings of the endpoint will appear, with the production variants listed under **Endpoint runtime settings** section. 

1.  Select the production variant that you want to deregister, and choose **Configure auto scaling**. The **Configure variant automatic scaling** dialog box appears. 

1.  Choose **Deregister auto scaling**. 

# Troubleshooting
<a name="serverless-endpoints-troubleshooting"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

If you are having trouble with Serverless Inference, refer to the following troubleshooting tips.

## Container issues
<a name="serverless-endpoints-troubleshooting-containers"></a>

If the container you use for a serverless endpoint is the same one you used on an instance-based endpoint, your container may not have permissions to write files. This can happen for the following reasons:
+ Your serverless endpoint fails to create or update due to a ping health check failure.
+ The Amazon CloudWatch logs for the endpoint show that the container is failing to write to some file or directory due to a permissions error.

To fix this issue, you can try to add read, write, and execute permissions for `other` on the file or directory and then rebuild the container. You can perform the following steps to complete this process:

1. In the Dockerfile you used to build your container, add the following command: `RUN chmod o+rwX <file or directory name>`

1. Rebuild the container.

1. Upload the new container image to Amazon ECR.

1. Try to create or update the serverless endpoint again.

# Asynchronous inference
<a name="async-inference"></a>

Amazon SageMaker Asynchronous Inference is a capability in SageMaker AI that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

## How It Works
<a name="async-inference-how-it-works"></a>

Creating an asynchronous inference endpoint is similar to creating real-time inference endpoints. You can use your existing SageMaker AI models and only need to specify the `AsyncInferenceConfig` object while creating your endpoint configuration with the `EndpointConfig` field in the `CreateEndpointConfig` API. The following diagram shows the architecture and workflow of Asynchronous Inference.

![\[Architecture diagram of Asynchronous Inference showing how a user invokes an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/async-architecture.png)


To invoke the endpoint, you need to place the request payload in Amazon S3. You also need to provide a pointer to this payload as a part of the `InvokeEndpointAsync` request. Upon invocation, SageMaker AI queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker AI places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon SNS. For more information about how to set up asynchronous notifications, see [Check prediction results](async-inference-check-predictions.md).

**Note**  
The presence of an asynchronous inference configuration (`AsyncInferenceConfig`) object in the endpoint configuration implies that the endpoint can only receive asynchronous invocations.

## How Do I Get Started?
<a name="async-inference-how-to-get-started"></a>

If you are a first-time user of Amazon SageMaker Asynchronous Inference, we recommend that you do the following:
+ Read [Asynchronous endpoint operations](async-inference-create-invoke-update-delete.md) for information on how to create, invoke, update, and delete an asynchronous endpoint.
+ Explore the [Asynchronous Inference example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb) in the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) GitHub repository.

Note that if your endpoint uses any of the features listed in this [Exclusions](deployment-guardrails-exclusions.md) page, you cannot use Asynchronous Inference.

# Asynchronous endpoint operations
<a name="async-inference-create-invoke-update-delete"></a>

This guide demonstrates the prerequisites you must satisfy to create an asynchronous endpoint, along with how to create, invoke, and delete your asynchronous endpoints. You can create, update, delete, and invoke asynchronous endpoints with the AWS SDKs and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-asynchronous-inference).

**Topics**
+ [

# Complete the prerequisites
](async-inference-create-endpoint-prerequisites.md)
+ [

# How to create an Asynchronous Inference Endpoint
](async-inference-create-endpoint.md)
+ [

# Invoke an Asynchronous Endpoint
](async-inference-invoke-endpoint.md)
+ [

# Update an Asynchronous Endpoint
](async-inference-update-endpoint.md)
+ [

# Delete an Asynchronous Endpoint
](async-inference-delete-endpoint.md)

# Complete the prerequisites
<a name="async-inference-create-endpoint-prerequisites"></a>

The following topic describes the prerequisites that you must complete before creating an asyncrhonous endpoint. These prerequisites include properly storing your model artifacts, configuring an AWS IAM with the correct permissions, and selecting a container image.

**To complete the prerequisites**

1. **Create an IAM role for Amazon SageMaker AI.**

   Asynchronous Inference needs access to your Amazon S3 bucket URI. To facilitate this, create an IAM role that can run SageMaker AI and has permission to access Amazon S3 and Amazon SNS. Using this role, SageMaker AI can run under your account and access your Amazon S3 bucket and Amazon SNS topics.

   You can create an IAM role by using the IAM console, AWS SDK for Python (Boto3), or AWS CLI. The following is an example of how to create an IAM role and attach the necessary policies with the IAM console.

   1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. In the navigation pane of the IAM console, choose **Roles**, and then choose **Create role**.

   1. For **Select type of trusted entity**, choose **AWS service**.

   1. Choose the service that you want to allow to assume this role. In this case, choose **SageMaker AI**. Then choose **Next: Permissions**.
      + This automatically creates an IAM policy that grants access to related services such as Amazon S3, Amazon ECR, and CloudWatch Logs.

   1. Choose **Next: Tags**.

   1. (Optional) Add metadata to the role by attaching tags as key–value pairs. For more information about using tags in IAM, see [Tagging IAM resources](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_tags.html).

   1. Choose **Next: Review**.

   1. Type in a **Role name**. 

   1. If possible, type a role name or role name suffix. Role names must be unique within your AWS account. They are not distinguished by case. For example, you cannot create roles named both `PRODROLE` and `prodrole`. Because other AWS resources might reference the role, you cannot edit the name of the role after it has been created.

   1. (Optional) For **Role description**, type a description for the new role.

   1. Review the role and then choose **Create role**.

      Note the SageMaker AI role ARN. To find the role ARN using the console, do the following:

      1. Go to the IAM console: [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)

      1. Select **Roles**.

      1. Search for the role you just created by typing in the name of the role in the search field.

      1. Select the role.

      1. The role ARN is at the top of the **Summary** page.

1. **Add Amazon SageMaker AI, Amazon S3 and Amazon SNS Permissions to your IAM Role.**

   Once the role is created, grant SageMaker AI, Amazon S3, and optionally Amazon SNS permissions to your IAM role.

   Choose **Roles** in the IAM console. Search for the role you created by typing in your role name in the **Search** field.

   1. Choose your role.

   1. Next, choose **Attach Policies**.

   1. Amazon SageMaker Asynchronous Inference needs permission to perform the following actions: `"sagemaker:CreateModel"`, `"sagemaker:CreateEndpointConfig"`, `"sagemaker:CreateEndpoint"`, and `"sagemaker:InvokeEndpointAsync"`. 

      These actions are included in the `AmazonSageMakerFullAccess` policy. Add this policy to your IAM role. Search for `AmazonSageMakerFullAccess` in the **Search** field. Select `AmazonSageMakerFullAccess`.

   1. Choose **Attach policy**.

   1. Next, choose **Attach Policies** to add Amazon S3 permissions.

   1. Select **Create policy**.

   1. Select the `JSON` tab.

   1. Add the following policy statement:

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:AbortMultipartUpload",
                      "s3:ListBucket"  
                  ],
                  "Effect": "Allow",
                  "Resource": "arn:aws:s3:::bucket_name/*"
              }
          ]
      }
      ```

------

   1. Choose **Next: Tags**.

   1. Type in a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the same steps you completed to add Amazon S3 permissions in order to add Amazon SNS permissions. For the policy statement, attach the following:

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Action": [
                      "sns:Publish"
                  ],
                  "Effect": "Allow",
      "Resource": "arn:aws:sns:us-east-1:111122223333:SNS_Topic"
              }
          ]
      }
      ```

------

1. **Upload your inference data (e.g., machine learning model, sample data) to **Amazon S3**.**

1. **Select a prebuilt Docker inference image or create your own Inference Docker Image.**

   SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker AI images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). If you choose to use a SageMaker AI provided container, you can increase the endpoint timeout and payload sizes from the default by setting the environment variables in the container. To learn how to set the different environment variables for each framework, see the Create a Model step of creating an asynchronous endpoint.

   If none of the existing SageMaker AI containers meet your needs and you don't have an existing container of your own, you may need to create a new Docker container. See [Containers with custom inference code](your-algorithms-inference-main.md) for information on how to create your Docker image.

1. **Create an Amazon SNS topic (optional)**

   Create an Amazon Simple Notification Service (Amazon SNS) topic that sends notifications about requests that have completed processing. Amazon SNS is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages via a choice of transport protocols, including HTTP, Amazon SQS, and email. You can specify Amazon SNS topics when you create an `EndpointConfig` object when you specify `AsyncInferenceConfig` using the `EndpointConfig` API. 

   Follow the steps to create and subscribe to an Amazon SNS topic.

   1. Using Amazon SNS console, create a topic. For instructions, see [Creating an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/CreateTopic.html) in the *Amazon Simple Notification Service* *Developer Guide*.

   1. Subscribe to the topic. For instructions, see [Subscribing to an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html) in the *Amazon Simple Notification Service** Developer Guide*.

   1. When you receive email requesting that you confirm your subscription to the topic, confirm the subscription.

   1. Note the topic Amazon Resource Name (ARN). The Amazon SNS topic you created is another resource in your AWS account, and it has a unique ARN. The ARN is in the following format:

      ```
      arn:aws:sns:aws-region:account-id:topic-name
      ```

   For more information about Amazon SNS, see the [Amazon SNS Developer Guide](https://docs.aws.amazon.com/sns/latest/dg/welcome.html).

# How to create an Asynchronous Inference Endpoint
<a name="async-inference-create-endpoint"></a>

Create an asynchronous endpoint the same way you would create an endpoint using SageMaker AI hosting services:
+ Create a model in SageMaker AI with `CreateModel`.
+ Create an endpoint configuration with `CreateEndpointConfig`.
+ Create an HTTPS endpoint with `CreateEndpoint`.

To create an endpoint, you first create a model with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), where you point to the model artifact and a Docker registry path (Image). You then create a configuration using [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where you specify one or more models that were created using the `CreateModel` API to deploy and the resources that you want SageMaker AI to provision. Create your endpoint with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) using the endpoint configuration specified in the request. You can update an asynchronous endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API. Send and receive inference requests from the model hosted at the endpoint with `InvokeEndpointAsync`. You can delete your endpoints with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API.

For a full list of the available SageMaker Images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). See [Containers with custom inference code](your-algorithms-inference-main.md) for information on how to create your Docker image.

**Topics**
+ [

# Create a Model
](async-inference-create-endpoint-create-model.md)
+ [

# Create an Endpoint Configuration
](async-inference-create-endpoint-create-endpoint-config.md)
+ [

# Create Endpoint
](async-inference-create-endpoint-create-endpoint.md)

# Create a Model
<a name="async-inference-create-endpoint-create-model"></a>

The following example shows how to create a model using the AWS SDK for Python (Boto3). The first few lines define:
+ `sagemaker_client`: A low-level SageMaker AI client object that makes it easy to send and receive requests to AWS services.
+ `sagemaker_role`: A string variable with the SageMaker AI IAM role Amazon Resource Name (ARN).
+ `aws_region`: A string variable with the name of your AWS region.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Role to give SageMaker permission to access AWS services.
sagemaker_role= "arn:aws:iam::<account>:role/*"
```

Next, specify the location of the pre-trained model stored in Amazon S3. In this example, we use a pre-trained XGBoost model named `demo-xgboost-model.tar.gz`. The full Amazon S3 URI is stored in a string variable `model_url`:

```
#Create a variable w/ the model S3 URI
s3_bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
bucket_prefix='saved_models'
model_s3_key = f"{bucket_prefix}/demo-xgboost-model.tar.gz"

#Specify S3 bucket w/ model
model_url = f"s3://{s3_bucket}/{model_s3_key}"
```

Specify a primary container. For the primary container, you specify the Docker image that contains inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions.

 In this example, we specify an XGBoost built-in algorithm container image: 

```
from sagemaker import image_uris

# Specify an AWS container image. 
container = image_uris.retrieve(region=aws_region, framework='xgboost', version='0.90-1')
```

Create a model in Amazon SageMaker AI with `CreateModel`. Specify the following:
+ `ModelName`: A name for your model (in this example it is stored as a string variable called `model_name`).
+ `ExecutionRoleArn`: The Amazon Resource Name (ARN) of the IAM role that Amazon SageMaker AI can assume to access model artifacts and Docker images for deployment on ML compute instances or for batch transform jobs.
+ `PrimaryContainer`: The location of the primary Docker image containing inference code, associated artifacts, and custom environment maps that the inference code uses when the model is deployed for predictions.

```
model_name = '<The_name_of_the_model>'

#Create model
create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url,
    })
```

See [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) description in the SageMaker API Reference Guide for a full list of API parameters.

If you're using a SageMaker AI provided container, you can increase the model server timeout and payload sizes from the default values to the framework‐supported maximums by setting environment variables in this step. You might not be able to leverage the maximum timeout and payload sizes that Asynchronous Inference supports if you don't explicitly set these variables. The following example shows how you can set the environment variables for a PyTorch Inference container based on TorchServe.

```
model_name = '<The_name_of_the_model>'

#Create model
create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url,
        'Environment': {
            'TS_MAX_REQUEST_SIZE': '100000000',
            'TS_MAX_RESPONSE_SIZE': '100000000',
            'TS_DEFAULT_RESPONSE_TIMEOUT': '1000'
        },
    })
```

After you finish creating your endpoint, you should test that you've set the environment variables correctly by printing them out from your `inference.py` script. The following table lists the environment variables for several frameworks that you can set to change the default values.


| Framework | Environment variables | 
| --- | --- | 
|  PyTorch 1.8 (based on TorchServe)  |  'TS\$1MAX\$1REQUEST\$1SIZE': '100000000' 'TS\$1MAX\$1RESPONSE\$1SIZE': '100000000' 'TS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '1000'  | 
|  PyTorch 1.4 (based on MMS)  |  'MMS\$1MAX\$1REQUEST\$1SIZE': '1000000000' 'MMS\$1MAX\$1RESPONSE\$1SIZE': '1000000000' 'MMS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '900'  | 
|  HuggingFace Inference Container (based on MMS)  |  'MMS\$1MAX\$1REQUEST\$1SIZE': '2000000000' 'MMS\$1MAX\$1RESPONSE\$1SIZE': '2000000000' 'MMS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '900'  | 

# Create an Endpoint Configuration
<a name="async-inference-create-endpoint-create-endpoint-config"></a>

Once you have a model, create an endpoint configuration with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). Amazon SageMaker AI hosting services uses this configuration to deploy models. In the configuration, you identify one or more models, created using with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), to deploy the resources that you want Amazon SageMaker AI to provision. Specify the `AsyncInferenceConfig` object and provide an output Amazon S3 location for `OutputConfig`. You can optionally specify [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) topics on which to send notifications about prediction results. For more information about Amazon SNS topics, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html).

The following example shows how to create an endpoint configuration using AWS SDK for Python (Boto3):

```
import datetime
from time import gmtime, strftime

# Create an endpoint config name. Here we create one based on the date  
# so it we can search endpoints based on creation time.
endpoint_config_name = f"XGBoostEndpointConfig-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}"

# The name of the model that you want to host. This is the name that you specified when creating the model.
model_name='<The_name_of_your_model>'

create_endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, # You will specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": "ml.m5.xlarge", # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            # Location to upload response outputs when no location is provided in the request.
            "S3OutputPath": f"s3://{s3_bucket}/{bucket_prefix}/output"
            # (Optional) specify Amazon SNS topics
            "NotificationConfig": {
                "SuccessTopic": "arn:aws:sns:aws-region:account-id:topic-name",
                "ErrorTopic": "arn:aws:sns:aws-region:account-id:topic-name",
            }
        },
        "ClientConfig": {
            # (Optional) Specify the max number of inflight invocations per instance
            # If no value is provided, Amazon SageMaker will choose an optimal value for you
            "MaxConcurrentInvocationsPerInstance": 4
        }
    }
)

print(f"Created EndpointConfig: {create_endpoint_config_response['EndpointConfigArn']}")
```

In the aforementioned example, you specify the following keys for `OutputConfig` for the `AsyncInferenceConfig` field:
+ `S3OutputPath`: Location to upload response outputs when no location is provided in the request.
+ `NotificationConfig`: (Optional) SNS topics that post notifications to you when an inference request is successful (`SuccessTopic`) or if it fails (`ErrorTopic`).

You can also specify the following optional argument for `ClientConfig` in the `AsyncInferenceConfig` field:
+ `MaxConcurrentInvocationsPerInstance`: (Optional) The maximum number of concurrent requests sent by the SageMaker AI client to the model container.

# Create Endpoint
<a name="async-inference-create-endpoint-create-endpoint"></a>

Once you have your model and endpoint configuration, use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources and deploy models.

```
# The name of the endpoint.The name must be unique within an AWS Region in your AWS account.
endpoint_name = '<endpoint-name>' 

# The name of the endpoint configuration associated with this endpoint.
endpoint_config_name='<endpoint-config-name>'

create_endpoint_response = sagemaker_client.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name)
```

When you call the `CreateEndpoint` API, Amazon SageMaker Asynchronous Inference sends a test notification to check that you have configured an Amazon SNS topic. Amazon SageMaker Asynchronous Inference also sends test notifications after calls to `UpdateEndpoint` and `UpdateEndpointWeightsAndCapacities`. This lets SageMaker AI check that you have the required permissions. The notification can simply be ignored. The test notification has the following form:

```
{
    "eventVersion":"1.0",
    "eventSource":"aws:sagemaker",
    "eventName":"TestNotification"
}
```

# Invoke an Asynchronous Endpoint
<a name="async-inference-invoke-endpoint"></a>

Get inferences from the model hosted at your asynchronous endpoint with `InvokeEndpointAsync`. 

**Note**  
If you have not done so already, upload your inference data (e.g., machine learning model, sample data) to Amazon S3.

Specify the following fields in your request:
+ For `InputLocation`, specify the location of your inference data.
+ For `EndpointName`, specify the name of your endpoint.
+ (Optional) For `InvocationTimeoutSeconds`, you can set the max timeout for the requests. You can set this value to a maximum of 3600 seconds (one hour) on a per-request basis. If you don't specify this field in your request, by default the request times out at 15 minutes.

```
# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=<aws_region>)

# Specify the location of the input. Here, a single SVM sample
input_location = "s3://bucket-name/test_point_0.libsvm"

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account. 
endpoint_name='<endpoint-name>'

# After you deploy a model into production using SageMaker AI hosting 
# services, your client applications use this API to get inferences 
# from the model hosted at the specified endpoint.
response = sagemaker_runtime.invoke_endpoint_async(
                            EndpointName=endpoint_name, 
                            InputLocation=input_location,
                            InvocationTimeoutSeconds=3600)
```

You receive a response as a JSON string with your request ID and the name of the Amazon S3 bucket that will have the response to the API call after it is processed.

# Update an Asynchronous Endpoint
<a name="async-inference-update-endpoint"></a>

Update an asynchronous endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API. When you update an endpoint, SageMaker AI first provisions and switches to the new endpoint configuration you specify before it deletes the resources that were provisioned in the previous endpoint configuration. Do not delete an `EndpointConfig` with an endpoint that is live or while the `UpdateEndpoint` or `CreateEndpoint` operations are being performed on the endpoint. 

```
# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name='<endpoint-name>'

# The name of the endpoint configuration associated with this endpoint.
endpoint_config_name='<endpoint-config-name>'

sagemaker_client.update_endpoint(
                                EndpointConfigName=endpoint_config_name,
                                EndpointName=endpoint_name
                                )
```

When Amazon SageMaker AI receives the request, it sets the endpoint status to **Updating**. After updating the asynchronous endpoint, it sets the status to **InService**. To check the status of an endpoint, use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. For a full list of parameters you can specify when updating an endpoint, see the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API.

# Delete an Asynchronous Endpoint
<a name="async-inference-delete-endpoint"></a>

Delete an asynchronous endpoint in a similar manner to how you would delete a SageMaker AI hosted endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API. Specify the name of the asynchronous endpoint you want to delete. When you delete an endpoint, SageMaker AI frees up all of the resources that were deployed when the endpoint was created. Deleting a model does not delete model artifacts, inference code, or the IAM role that you specified when creating the model.

Delete your SageMaker AI model with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html) API or with the SageMaker AI console.

------
#### [ Boto3 ]

```
import boto3 

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<aws_region>)
sagemaker_client.delete_endpoint(EndpointName='<endpoint-name>')
```

------
#### [ SageMaker AI console ]

1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Expand the **Inference** dropdown list.

1. Select **Endpoints**.

1. Search for endpoint in the **Search endpoints** search bar.

1. Select your endpoint.

1. Choose **Delete**.

------

In addition to deleting the asynchronous endpoint, you might want to clear up other resources that were used to create the endpoint, such as the Amazon ECR repository (if you created a custom inference image), the SageMaker AI model, and the asynchronous endpoint configuration itself. 

# Alarms and logs for tracking metrics from asynchronous endpoints
<a name="async-inference-monitor"></a>

You can monitor SageMaker AI using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. With Amazon CloudWatch, you can access historical information and gain a better perspective on how your web application or service is performing. For more information about Amazon CloudWatch, see [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)

## Monitoring with CloudWatch
<a name="async-inference-monitor-cloudwatch"></a>

The metrics below are an exhaustive list of metrics for asynchronous endpoints and are in the the `AWS/SageMaker` namespace. Any metric not listed below is not published if the endpoint is enabled for asynchronous inference. Such metrics include (but are not limited to):
+ OverheadLatency
+ Invocations
+ InvocationsPerInstance

### Common Endpoint Metrics
<a name="async-inference-monitor-cloudwatch-common"></a>

These metrics are the same as the metrics published for real-time endpoints today. For more information about other metrics in Amazon CloudWatch, see [Monitor SageMaker AI with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `Invocation4XXErrors` | The number of requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. | Units: NoneValid statistics: Average, Sum | 
| `Invocation5XXErrors` | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. | Units: NoneValid statistics: Average, Sum | 
| `ModelLatency` | The interval of time taken by a model to respond as viewed from SageMaker AI. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count | 

### Asynchronous Inference Endpoint Metrics
<a name="async-inference-monitor-cloudwatch-async"></a>

These metrics are published for endpoints enabled for asynchronous inference. The following metrics are published with the `EndpointName` dimension:


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `ApproximateBacklogSize` | The number of items in the queue for an endpoint that are currently being processed or yet to be processed. | Units: Count Valid statistics: Average, Max, Min  | 
| `ApproximateBacklogSizePerInstance` | Number of items in the queue divided by the number of instances behind an endpoint. This metric is primarily used for setting up application autoscaling for an async-enabled endpoint. | Units: CountValid statistics: Average, Max, Min | 
| `ApproximateAgeOfOldestRequest` | Age of the oldest request in the queue. | Units: SecondsValid statistics: Average, Max, Min | 
| `HasBacklogWithoutCapacity` | The value of this metric is `1` when there are requests in the queue but zero instances behind the endpoint. The value is `0` at all other times. You can use this metric for autoscaling your endpoint up from zero instances upon receiving a new request in the queue. | Units: CountValid statistics: Average | 

The following metrics are published with the `EndpointName` and `VariantName` dimensions:


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `RequestDownloadFailures` | When an inference failure occurs due to an issue downloading the request from Amazon S3. | Units: CountValid statistics: Sum | 
| `ResponseUploadFailures` | When an inference failure occurs due to an issue uploading the response to Amazon S3. | Units: CountValid statistics: Sum | 
| `NotificationFailures` | When an issue occurs publishing notifications. | Units: CountValid statistics: Sum | 
| `RequestDownloadLatency` | Total time to download the request payload. | Units: MicrosecondsValid statistics: Average, Sum, Min, Max, Sample Count | 
| `ResponseUploadLatency` | Total time to upload the response payload. | Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count | 
| `ExpiredRequests` | Number of requests in the queue that fail due to reaching their specified request TTL. | Units: CountValid statistics: Sum | 
| `InvocationFailures` | If an invocation fails for any reason. | Units: CountValid statistics: Sum | 
| `InvocationsProcesssed` | Number of async invocations processed by the endpoint. | Units: CountValid statistics: Sum | 
| `TimeInBacklog` | Total time the request was queued before being processed. This does not include the actual processing time (i.e. downloading time, uploading time, model latency). | Units: MillisecondsValid statistics: Average, Sum, Min, Max, Sample Count | 
| `TotalProcessingTime` | Time the inference request was recieved by SageMaker AI to the time the request finished processing. This includes time in backlog and time to upload and send response notifications, if any. | Units: MillisecondsValid statistics: Average, Sum, Min, Max, Sample Count | 

Amazon SageMaker Asynchronous Inference also includes host-level metrics. For information on host-level metrics, see [SageMaker AI Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs).

## Logs
<a name="async-inference-monitor-logs"></a>

In addition to the [Model container logs](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html) that are published to Amazon CloudWatch in your account, you also get a new platform log for tracing and debugging inference requests.

The new logs are published under the Endpoint Log Group:

```
/aws/sagemaker/Endpoints/[EndpointName]
```

The log stream name consists of: 

```
[production-variant-name]/[instance-id]/data-log.
```

Log lines contain the request’s inference ID so that errors can be easily mapped to a particular request.

# Check prediction results
<a name="async-inference-check-predictions"></a>

There are several ways you can check predictions results from your asynchronous endpoint. Some options are:

1. Amazon SNS topics.

1. Check for outputs in your Amazon S3 bucket.

## Amazon SNS Topics
<a name="async-inference-check-predictions-sns-topic"></a>

Amazon SNS is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages via a choice of transport protocols, including HTTP, Amazon SQS, and email. Amazon SageMaker Asynchronous Inference posts notifications when you create an endpoint with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) and specify an Amazon SNS topic.

**Note**  
In order to receive Amazon SNS notifications, your IAM role must have `sns:Publish` permissions. See the [Complete the prerequisites](async-inference-create-endpoint-prerequisites.md) for information on requirements you must satisfy to use Asynchronous Inference.

To use Amazon SNS to check prediction results from your asynchronous endpoint, you first need to create a topic, subscribe to the topic, confirm your subscription to the topic, and note the Amazon Resource Name (ARN) of that topic. For detailed information on how to create, subscribe, and find the Amazon ARN of an Amazon SNS topic, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html).

Provide the Amazon SNS topic ARN(s) in the `AsyncInferenceConfig` field when you create an endpoint configuration with `CreateEndpointConfig`. You can specify both an Amazon SNS `ErrorTopic` and an `SuccessTopic`.

```
import boto3

sagemaker_client = boto3.client('sagemaker', region_name=<aws_region>)

sagemaker_client.create_endpoint_config(
    EndpointConfigName=<endpoint_config_name>, # You specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": "model_name", 
            "InstanceType": "ml.m5.xlarge", # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            # Location to upload response outputs when no location is provided in the request.
            "S3OutputPath": "s3://<bucket>/<output_directory>"
            "NotificationConfig": {
                "SuccessTopic": "arn:aws:sns:aws-region:account-id:topic-name",
                "ErrorTopic": "arn:aws:sns:aws-region:account-id:topic-name",
            }
        }
    }
)
```

After creating your endpoint and invoking it, you receive a notification from your Amazon SNS topic. For example, if you subscribed to receive email notifications from your topic, you receive an email notification every time you invoke your endpoint. The following example shows the JSON content of a successful invocation email notification.

```
{
   "awsRegion":"us-east-1",
   "eventTime":"2022-01-25T22:46:00.608Z",
   "receivedTime":"2022-01-25T22:46:00.455Z",
   "invocationStatus":"Completed",
   "requestParameters":{
      "contentType":"text/csv",
      "endpointName":"<example-endpoint>",
      "inputLocation":"s3://<bucket>/<input-directory>/input-data.csv"
   },
   "responseParameters":{
      "contentType":"text/csv; charset=utf-8",
      "outputLocation":"s3://<bucket>/<output_directory>/prediction.out"
   },
   "inferenceId":"11111111-2222-3333-4444-555555555555", 
   "eventVersion":"1.0",
   "eventSource":"aws:sagemaker",
   "eventName":"InferenceResult"
}
```

## Check Your S3 Bucket
<a name="async-inference-check-predictions-s3-bucket"></a>

When you invoke an endpoint with `InvokeEndpointAsync`, it returns a response object. You can use the response object to get the Amazon S3 URI where your output is stored. With the output location, you can use a SageMaker Python SDK SageMaker AI session class to programmatically check for on an output.

The following stores the output dictionary of `InvokeEndpointAsync` as a variable named response. With the response variable, you then get the Amazon S3 output URI and store it as a string variable called `output_location`. 

```
import uuid
import boto3

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=<aws_region>)

# Specify the S3 URI of the input. Here, a single SVM sample
input_location = "s3://bucket-name/test_point_0.libsvm" 

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName='<endpoint-name>',
    InputLocation=input_location,
    InferenceId=str(uuid.uuid4()), 
    ContentType="text/libsvm" #Specify the content type of your data
)

output_location = response['OutputLocation']
print(f"OutputLocation: {output_location}")
```

For information about supported content types, see [Common data formats for inference](cdf-inference.md).

With the Amazon S3 output location, you can then use a [SageMaker Python SDK SageMaker AI Session Class](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=session) to read in Amazon S3 files. The following code example shows how to create a function (`get_ouput`) that repeatedly attempts to read a file from the Amazon S3 output location:

```
import sagemaker
import urllib, time
from botocore.exceptions import ClientError

sagemaker_session = sagemaker.session.Session()

def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sagemaker_session.read_s3_file(
                                        bucket=output_url.netloc, 
                                        key_prefix=output_url.path[1:])
        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchKey':
                print("waiting for output...")
                time.sleep(2)
                continue
            raise
            
output = get_output(output_location)
print(f"Output: {output}")
```

# Autoscale an asynchronous endpoint
<a name="async-inference-autoscale"></a>

Amazon SageMaker AI supports automatic scaling (autoscaling) your asynchronous endpoint. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Unlike other hosted models Amazon SageMaker AI supports, with Asynchronous Inference you can also scale down your asynchronous endpoints instances to zero. Requests that are received when there are zero instances are queued for processing once the endpoint scales up.

To autoscale your asynchronous endpoint you must at a minimum:
+ Register a deployed model (production variant).
+ Define a scaling policy.
+ Apply the autoscaling policy.

Before you can use autoscaling, you must have already deployed a model to a SageMaker AI endpoint. Deployed models are referred to as a [production variant](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). See [Deploy the Model to SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html#ex1-deploy-model) for more information about deploying a model to an endpoint. To specify the metrics and target values for a scaling policy, you configure a scaling policy. For information on how to define a scaling policy, see [Define a scaling policy](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-define.html). After registering your model and defining a scaling policy, apply the scaling policy to the registered model. For information on how to apply the scaling policy, see [Apply a scaling policy](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-apply.html).

For more information on how to define an optional additional scaling policy that scales up your endpoint upon receiving a request after your endpoint has been scaled down to zero, see [Optional: Define a scaling policy that scales up from zero for new requests](#async-inference-autoscale-scale-up). If you don’t specify this optional policy, then your endpoint only initiates scaling up from zero after the number of backlog requests exceeds the target tracking value.

 For details on other prerequisites and components used with autoscaling, see the [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html) section in the SageMaker AI autoscaling documentation.

**Note**  
If you attach multiple scaling policies to the same autoscaling group, you might have scaling conflicts. When a conflict occurs, Amazon EC2 Auto Scaling chooses the policy that provisions the largest capacity for both scale out and scale in. For more information about this behavior, see [Multiple dynamic scaling policies](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html#multiple-scaling-policy-resolution) in the *Amazon EC2 Auto Scaling documentation*.

## Define a scaling policy
<a name="async-inference-autoscale-define-async"></a>

To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see [https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the Application Auto Scaling API Reference.

For asynchronous endpoints SageMaker AI strongly recommends that you create a policy configuration for target-tracking scaling for a variant. In this configuration example, we use a custom metric, `CustomizedMetricSpecification`, called `ApproximateBacklogSizePerInstance`.

```
TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 5.0, # The target value for the metric. Here the metric is: ApproximateBacklogSizePerInstance
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': <endpoint_name> }
            ],
            'Statistic': 'Average',
        }
    }
```

## Define a scaling policy that scales to zero
<a name="async-inference-autoscale-define-async-zero"></a>

The following shows you how to both define and register your endpoint variant with application autoscaling using the AWS SDK for Python (Boto3). After defining a low-level client object representing application autoscaling with Boto3, we use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) method to register the production variant. We set `MinCapacity` to 0 because Asynchronous Inference enables you to autoscale to 0 when there are no requests to process.

```
# Common class representing application autoscaling for SageMaker 
client = boto3.client('application-autoscaling') 

# This is the format in which application autoscaling references the endpoint
resource_id='endpoint/' + <endpoint_name> + '/variant/' + <'variant1'> 

# Define and register your endpoint variant
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
    MinCapacity=0,
    MaxCapacity=5
)
```

For detailed description about the Application Autoscaling API, see the [Application Scaling Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) documentation.

## Optional: Define a scaling policy that scales up from zero for new requests
<a name="async-inference-autoscale-scale-up"></a>

You might have a use case where you have sporadic requests or periods with low numbers of requests. If your endpoint has been scaled down to zero instances during these periods, then your endpoint won’t scale up again until the number of requests in the queue exceeds the target specified in your scaling policy. This can result in long waiting times for requests in the queue. The following section shows you how to create an additional scaling policy that scales your endpoint up from zero instances after receiving any new request in the queue. Your endpoint will be able to respond to new requests more quickly instead of waiting for the queue size to exceed the target.

To create a scaling policy for your endpoint that scales up from zero instances, do the following:

1. Create a scaling policy that defines the desired behavior, which is to scale up your endpoint when it’s at zero instances but has requests in the queue. The following shows you how to define a scaling policy called `HasBacklogWithoutCapacity-ScalingPolicy` using the AWS SDK for Python (Boto3). When the queue is greater than zero and the current instance count for your endpoint is also zero, the policy scales your endpoint up. In all other cases, the policy does not affect scaling for your endpoint.

   ```
   response = client.put_scaling_policy(
       PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
       ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
       ResourceId=resource_id,  # Endpoint name
       ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
       PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
       StepScalingPolicyConfiguration={
           "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
           "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
           "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
           "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
           [ 
               {
                 "MetricIntervalLowerBound": 0,
                 "ScalingAdjustment": 1
               }
             ]
       },    
   )
   ```

1. Create a CloudWatch alarm with the custom metric `HasBacklogWithoutCapacity`. When triggered, the alarm initiates the previously defined scaling policy. For more information about the `HasBacklogWithoutCapacity` metric, see [Asynchronous Inference Endpoint Metrics](async-inference-monitor.md#async-inference-monitor-cloudwatch-async).

   ```
   response = cw_client.put_metric_alarm(
       AlarmName=step_scaling_policy_alarm_name,
       MetricName='HasBacklogWithoutCapacity',
       Namespace='AWS/SageMaker',
       Statistic='Average',
       EvaluationPeriods= 2,
       DatapointsToAlarm= 2,
       Threshold= 1,
       ComparisonOperator='GreaterThanOrEqualToThreshold',
       TreatMissingData='missing',
       Dimensions=[
           { 'Name':'EndpointName', 'Value':endpoint_name },
       ],
       Period= 60,
       AlarmActions=[step_scaling_policy_arn]
   )
   ```

You should now have a scaling policy and CloudWatch alarm that scale up your endpoint from zero instances whenever your queue has pending requests.

# Troubleshooting
<a name="async-inference-troubleshooting"></a>

The following FAQs can help you troubleshoot issues with your Amazon SageMaker Asynchronous Inference endpoints.

## Q: I have autoscaling enabled. How can I find the instance count behind the endpoint at any given point?
<a name="async-troubleshooting-q1"></a>

You can use the following methods to find the instance count behind your endpoint:
+ You can use the SageMaker AI [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API to describe the number of instances behind the endpoint at any given point in time.
+ You can get the instance count by viewing your Amazon CloudWatch metrics. View the [metrics for your endpoint instances](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs), such as `CPUUtilization` or `MemoryUtilization` and check the sample count statistic for a 1 minute period. The count should be equal to the number of active instances. The following screenshot shows the `CPUUtilization` metric graphed in the CloudWatch console, where the **Statistic** is set to `Sample count`, the **Period** is set to `1 minute`, and the resulting count is 5.

![\[CloudWatch console showing the graph of the count of active instances for an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/cloudwatch-sample-count.png)


## Q: What are the common tunable environment variables for SageMaker AI containers?
<a name="async-troubleshooting-q2"></a>

The following tables outline the common tunable environment variables for SageMaker AI containers by framework type.

**TensorFlow**


| Environment variable | Description | 
| --- | --- | 
|  `SAGEMAKER_TFS_INSTANCE_COUNT`  |  For TensorFlow-based models, the `tensorflow_model_server` binary is the operational piece that is responsible for loading a model in memory, running inputs against a model graph, and deriving outputs. Typically, a single instance of this binary is launched to serve models in an endpoint. This binary is internally multi-threaded and spawns multiple threads to respond to an inference request. In certain instances, if you observe that the CPU is respectably utilized (over 30% utilized) but the memory is underutilized (less than 10% utilization), increasing this parameter might help. Increasing the number of `tensorflow_model_servers` available to serve typically increases the throughput of an endpoint.  | 
|  `SAGEMAKER_TFS_FRACTIONAL_GPU_MEM_MARGIN`  |  This parameter governs the fraction of the available GPU memory to initialize CUDA/cuDNN and other GPU libraries. `0.2` means 20% of the available GPU memory is reserved for initializing CUDA/cuDNN and other GPU libraries, and 80% of the available GPU memory is allocated equally across the TF processes. GPU memory is pre-allocated unless the `allow_growth` option is enabled.  | 
| `SAGEMAKER_TFS_INTER_OP_PARALLELISM` | This ties back to the `inter_op_parallelism_threads` variable. This variable determines the number of threads used by independent non-blocking operations. `0` means that the system picks an appropriate number. | 
| `SAGEMAKER_TFS_INTRA_OP_PARALLELISM` | This ties back to the `intra_op_parallelism_threads` variable. This determines the number of threads that can be used for certain operations like matrix multiplication and reductions for speedups. A value of `0` means that the system picks an appropriate number. | 
| `SAGEMAKER_GUNICORN_WORKERS` | This governs the number of worker processes that Gunicorn is requested to spawn for handling requests. This value is used in combination with other parameters to derive a set that maximizes inference throughput. In addition to this, the `SAGEMAKER_GUNICORN_WORKER_CLASS` governs the type of workers spawned, typically `async` or `gevent`. | 
| `SAGEMAKER_GUNICORN_WORKER_CLASS` | This governs the number of worker processes that Gunicorn is requested to spawn for handling requests. This value is used in combination with other parameters to derive a set that maximizes inference throughput. In addition to this, the `SAGEMAKER_GUNICORN_WORKER_CLASS` governs the type of workers spawned, typically `async` or `gevent`. | 
| `OMP_NUM_THREADS` | Python internally uses OpenMP for implementing multithreading within processes. Typically, threads equivalent to the number of CPU cores are spawned. But when implemented on top of Simultaneous Multi Threading (SMT), such Intel’s HypeThreading, a certain process might oversubscribe a particular core by spawning twice as many threads as the number of actual CPU cores. In certain cases, a Python binary might end up spawning up to four times as many threads as available processor cores. Therefore, an ideal setting for this parameter, if you have oversubscribed available cores using worker threads, is `1`, or half the number of CPU cores on a CPU with SMT turned on. | 
|  `TF_DISABLE_MKL` `TF_DISABLE_POOL_ALLOCATOR`  | In some cases, turning off MKL can speed up inference if `TF_DISABLE_MKL` and `TF_DISABLE_POOL_ALLOCATOR` are set to `1`. | 

**PyTorch**


| Environment variable | Description | 
| --- | --- | 
|  `SAGEMAKER_TS_MAX_BATCH_DELAY`  |  This is the maximum batch delay time TorchServe waits to receive.  | 
|  `SAGEMAKER_TS_BATCH_SIZE`  |  If TorchServe doesn’t receive the number of requests specified in `batch_size` before the timer runs out, it sends the requests that were received to the model handler.  | 
|  `SAGEMAKER_TS_MIN_WORKERS`  |  The minimum number of workers to which TorchServe is allowed to scale down.  | 
|  `SAGEMAKER_TS_MAX_WORKERS`  |  The maximum number of workers to which TorchServe is allowed to scale up.  | 
|  `SAGEMAKER_TS_RESPONSE_TIMEOUT`  |  The time delay, after which inference times out in absence of a response.  | 
|  `SAGEMAKER_TS_MAX_REQUEST_SIZE`  |  The maximum payload size for TorchServe.  | 
|  `SAGEMAKER_TS_MAX_RESPONSE_SIZE`  |  The maximum response size for TorchServe.  | 

**Multi Model Server (MMS)**


| Environment variable | Description | 
| --- | --- | 
|  `job_queue_size`  |  This parameter is useful to tune when you have a scenario where the type of the inference request payload is large, and due to the size of payload being larger, you may have higher heap memory consumption of the JVM in which this queue is being maintained. Ideally you might want to keep the heap memory requirements of JVM lower and allow Python workers to allot more memory for actual model serving. JVM is only for receiving the HTTP requests, queuing them, and dispatching them to the Python-based workers for inference. If you increase the `job_queue_size`, you might end up increasing the heap memory consumption of the JVM and ultimately taking away memory from the host that could have been used by Python workers. Therefore, exercise caution when tuning this parameter as well.  | 
|  `default_workers_per_model`  |  This parameter is for the backend model serving and might be valuable to tune since this is the critical component of the overall model serving, based on which the Python processes spawn threads for each Model. If this component is slower (or not tuned properly), the front-end tuning might not be effective.  | 

## Q: How do I make sure my container supports Asynchronous Inference?
<a name="async-troubleshooting-q3"></a>

You can use the same container for Asynchronous Inference that you do for Real-Time Inference or Batch Transform. You should confirm that the timeouts and payload size limits on your container are set to handle larger payloads and longer timeouts.

## Q: What are the limits specific to Asynchronous Inference, and can they be adjusted?
<a name="async-troubleshooting-q4"></a>

Refer to the following limits for Asynchronous Inference:
+ Payload size limit: 1 GB
+ Timeout limit: A request can take up to 60 minutes.
+ Queue message TimeToLive (TTL): 6 hours
+ Number of messages that can be put inside Amazon SQS: Unlimited. However, there is a quota of 120,000 for the number of in-flight messages for a standard queue, and 20,000 for a FIFO queue.

## Q: What metrics are best to define for autoscaling on Asynchronous Inference? Can I have multiple scaling policies?
<a name="async-troubleshooting-q5"></a>

In general, with Asynchronous Inference, you can scale out based on invocations or instances. For invocation metrics, it's a good idea to look at your `ApproximateBacklogSize`, which is a metric that defines the number of items in your queue that have yet to been processed. You can utilize this metric or your `InvocationsPerInstance` metric to understand what TPS you may be getting throttled at. At the instance level, check your instance type and its CPU/GPU utilization to define when to scale out. If a singular instance is above 60-70% capacity, this is often a good sign that you are saturating your hardware.

We don't recommend having multiple scaling policies, as these can conflict and lead to confusion at the hardware level, causing delays when scaling out.

## Q: Why is my asynchronous endpoint terminating an instance as `Unhealthy` and the update requests from autoscaling are failing?
<a name="async-troubleshooting-q6"></a>

Check if your container is able to handle ping and invoke requests concurrently. SageMaker AI invoke requests take approximately 3 minutes, and in this duration, usually multiple ping requests end up failing due to the timeout causing SageMaker AI to detect your container as `Unhealthy`.

## Q: Can `MaxConcurrentInvocationsPerInstance` work for my BYOC model container with the ningx/gunicorn/flask settings?
<a name="async-troubleshooting-q7"></a>

Yes. `MaxConcurrentInvocationsPerInstance` is a feature of asynchronous endpoints. This does not depend on the custom container implementation. `MaxConcurrentInvocationsPerInstance` controls the rate at which invoke requests are sent to the customer container. If this value is set as `1`, then only 1 request is sent to the container at a time, no matter how many workers are on the customer container.

## Q: How can I debug model server errors (500) on my asynchronous endpoint?
<a name="async-troubleshooting-q8"></a>

The error means that the customer container returned an error. SageMaker AI does not control the behavior of customer containers. SageMaker AI simply returns the response from the `ModelContainer` and does not retry. If you want, you can configure the invocation to retry on failure. We suggest that you turn on container logging and check your container logs to find the root cause of the 500 error from your model. Check the corresponding `CPUUtilization` and `MemoryUtilization` metrics at the point of failure as well. You can also configure the [S3FailurePath](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AsyncInferenceOutputConfig.html#sagemaker-Type-AsyncInferenceOutputConfig-S3FailurePath) to the model response in Amazon SNS as part of the Async Error Notifications to investiage failures.

## Q: How can I know if `MaxConcurrentInvocationsPerInstance=1` takes effect? Are there any metrics that I can check?
<a name="async-troubleshooting-q9"></a>

You can check the metric `InvocationsProcesssed`, which should align with the number of invocations that you expect to be processed in a minute based on single concurrency.

## Q: How can I track the success and failures of my invocation requests? What are the best practices?
<a name="async-troubleshooting-q10"></a>

The best practice is to enable Amazon SNS, which is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages from a choice of transport protocols, including HTTP, Amazon SQS, and email. Asynchronous Inference posts notifications when you create an endpoint with `CreateEndpointConfig` and specify an Amazon SNS topic.

To use Amazon SNS to check prediction results from your asynchronous endpoint, you first need to create a topic, subscribe to the topic, confirm your subscription to the topic, and note the Amazon Resource Name (ARN) of that topic. For detailed information on how to create, subscribe, and find the Amazon ARN of an Amazon SNS topic, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html) in the *Amazon SNS Developer Guide*. For more information about how to use Amazon SNS with Asynchronous Inference, see [Check prediction results](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-check-predictions.html).

## Q: Can I define a scaling policy that scales up from zero instances upon receiving a new request?
<a name="async-troubleshooting-q11"></a>

Yes. Asynchronous Inference provides a mechanism to scale down to zero instances when there are no requests. If your endpoint has been scaled down to zero instances during these periods, then your endpoint won’t scale up again until the number of requests in the queue exceeds the target specified in your scaling policy. This can result in long waiting times for requests in the queue. In such cases, if you want to scale up from zero instances for new requests less than the queue target specified, you can use an additional scaling policy called `HasBacklogWithoutCapacity`. For more information about how to define this scaling policy, see [Autoscale an asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-autoscale.html#async-inference-autoscale-scale-up).

## Q: I’m getting an error that the instance type is not supported for Asynchronous Inference. What are the instance types Asynchronous Inference supports?
<a name="async-troubleshooting-q12"></a>

For an exhaustive list of instances supported by Asynchronous Inference per region, see [SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/). Check if the required instance is available in your region before proceeding.

# Batch transform for inference with Amazon SageMaker AI
<a name="batch-transform"></a>

Use batch transform when you need to do the following: 
+ Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
+ Get inferences from large datasets.
+ Run inference when you don't need a persistent endpoint.
+ Associate input records with inferences to help with the interpretation of results.

To filter input data before performing inferences or to associate input records with inferences about those records, see [Associate Prediction Results with Input Records](batch-transform-data-processing.md). For example, you can filter input data to provide context for creating and interpreting reports about the output data.

**Topics**
+ [

## Use batch transform to get inferences from large datasets
](#batch-transform-large-datasets)
+ [

## Speed up a batch transform job
](#batch-transform-reduce-time)
+ [

## Use batch transform to test production variants
](#batch-transform-test-variants)
+ [

## Batch transform sample notebooks
](#batch-transform-notebooks)
+ [

# Associate Prediction Results with Input Records
](batch-transform-data-processing.md)
+ [

# Storage in Batch Transform
](batch-transform-storage.md)
+ [

# Troubleshooting
](batch-transform-errors.md)

## Use batch transform to get inferences from large datasets
<a name="batch-transform-large-datasets"></a>

Batch transform automatically manages the processing of large datasets within the limits of specified parameters. For example, having a dataset file, `input1.csv`, stored in an S3 bucket. The content of the input file might look like the following example.

```
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
Record2-Attribute1, Record2-Attribute2, Record2-Attribute3, ..., Record2-AttributeM
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
...
RecordN-Attribute1, RecordN-Attribute2, RecordN-Attribute3, ..., RecordN-AttributeM
```

When a batch transform job starts, SageMaker AI starts compute instances and distributes the inference or preprocessing workload between them. Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. When you have multiple files, one instance might process `input1.csv`, and another instance might process the file named `input2.csv`. If you have one input file but initialize multiple compute instances, only one instance processes the input file. The rest of the instances are idle.

You can also split input files into mini-batches. For example, you might create a mini-batch from `input1.csv` by including only two of the records.

```
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
Record4-Attribute1, Record4-Attribute2, Record4-Attribute3, ..., Record4-AttributeM
```

**Note**  
SageMaker AI processes each input file separately. It doesn't combine mini-batches from different input files to comply with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB               ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB               ) limit.

To split input files into mini-batches when you create a batch transform job, set the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ) parameter value to `Line`. SageMaker AI uses the entire input file in a single request when:
+ `SplitType` is set to `None`.
+ An input file can't be split into mini-batches.

. Note that Batch Transform doesn't support CSV-formatted input that contains embedded newline characters. You can control the size of the mini-batches by using the `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-BatchStrategy)` and `[MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB)` parameters. `MaxPayloadInMB` must not be greater than 100 MB. If you specify the optional `[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxConcurrentTransforms)` parameter, then the value of `(MaxConcurrentTransforms * MaxPayloadInMB)` must also not exceed 100 MB.

If the batch transform job successfully processes all of the records in an input file, it creates an output file. The output file has the same name and the `.out` file extension. For multiple input files, such as `input1.csv` and `input2.csv`, the output files are named `input1.csv.out` and `input2.csv.out`. The batch transform job stores the output files in the specified location in Amazon S3, such as `s3://amzn-s3-demo-bucket/output/`. 

The predictions in an output file are listed in the same order as the corresponding records in the input file. The output file `input1.csv.out`, based on the input file shown earlier, would look like the following.

```
Inference1-Attribute1, Inference1-Attribute2, Inference1-Attribute3, ..., Inference1-AttributeM
Inference2-Attribute1, Inference2-Attribute2, Inference2-Attribute3, ..., Inference2-AttributeM
Inference3-Attribute1, Inference3-Attribute2, Inference3-Attribute3, ..., Inference3-AttributeM
...
InferenceN-Attribute1, InferenceN-Attribute2, InferenceN-Attribute3, ..., InferenceN-AttributeM
```

If you set [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ) to `Line`, you can set the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith             ) parameter to `Line` to concatenate the output records with a line delimiter. This does not change the number of output files. The number of output files is equal to the number of input files, and using `AssembleWith` does not merge files. If you don't specify the `AssembleWith` parameter, the output records are concatenated in a binary format by default.

When the input data is very large and is transmitted using HTTP chunked encoding, to stream the data to the algorithm, set [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB) to `0`. Amazon SageMaker AI built-in algorithms don't support this feature.

For information about using the API to create a batch transform job, see the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API. For more information about the relationship between batch transform input and output objects, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputDataConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputDataConfig.html). For an example of how to use batch transform, see [(Optional) Make Prediction with Batch Transform](ex1-model-deployment.md#ex1-batch-transform).

## Speed up a batch transform job
<a name="batch-transform-reduce-time"></a>

If you are using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters. This includes parameters such as [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB), [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxConcurrentTransforms), or [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-BatchStrategy). The ideal value for `MaxConcurrentTransforms` is equal to the number of compute workers in the batch transform job. 

If you are using the SageMaker AI console, specify these optimal parameter values in the **Additional configuration** section of the **Batch transform job configuration** page. SageMaker AI automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an [execution-parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containe-serves-requests) endpoint.

## Use batch transform to test production variants
<a name="batch-transform-test-variants"></a>

To test different models or hyperparameter settings, create a separate transform job for each new model variant and use a validation dataset. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use [Inference Pipeline Logs and Metrics](inference-pipeline-logs-metrics.md).

## Batch transform sample notebooks
<a name="batch-transform-notebooks"></a>

For sample notebook that uses batch transform, see [Batch Transform with PCA and DBSCAN Movie Clusters](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.html). This notebook uses batch transform with a principal component analysis (PCA) model to reduce data in a user-item review matrix. It then shows the application of a density-based spatial clustering of applications with noise (DBSCAN) algorithm to cluster movies.

 For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the NTM algorithms are located in the **Advanced functionality** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# Associate Prediction Results with Input Records
<a name="batch-transform-data-processing"></a>

When making predictions on a large dataset, you can exclude attributes that aren't needed for prediction. After the predictions have been made, you can associate some of the excluded attributes with those predictions or with other input data in your report. By using batch transform to perform these data processing steps, you can often eliminate additional preprocessing or postprocessing. You can use input files in JSON and CSV format only. 

**Topics**
+ [

## Workflow for Associating Inferences with Input Records
](#batch-transform-data-processing-workflow)
+ [

## Use Data Processing in Batch Transform Jobs
](#batch-transform-data-processing-steps)
+ [

## Supported JSONPath Operators
](#data-processing-operators)
+ [

## Batch Transform Examples
](#batch-transform-data-processing-examples)

## Workflow for Associating Inferences with Input Records
<a name="batch-transform-data-processing-workflow"></a>

The following diagram shows the workflow for associating inferences with input records.

![\[The workflow for associating inferences with input records.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/batch-transform-data-processing.png)


To associate inferences with input data, there are three main steps:

1. Filter the input data that is not needed for inference before passing the input data to the batch transform job. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter                             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter                             ) parameter to determine which attributes to use as input for the model.

1. Associate the input data with the inference results. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-JoinSource                         ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-JoinSource                         ) parameter to combine the input data with the inference.

1. Filter the joined data to retain the inputs that are needed to provide context for interpreting the predictions in the reports. Use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-OutputFilter                             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-OutputFilter                             ) to store the specified portion of the joined dataset in the output file.

## Use Data Processing in Batch Transform Jobs
<a name="batch-transform-data-processing-steps"></a>

When creating a batch transform job with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) to process data:

1. Specify the portion of the input to pass to the model with the `InputFilter` parameter in the `DataProcessing` data structure. 

1. Join the raw input data with the transformed data with the `JoinSource` parameter.

1. Specify which portion of the joined input and transformed data from the batch transform job to include in the output file with the `OutputFilter` parameter.

1.  Choose either JSON- or CSV-formatted files for input: 
   + For JSON- or JSON Lines-formatted input files, SageMaker AI either adds the `SageMakerOutput` attribute to the input file or creates a new JSON output file with the `SageMakerInput` and `SageMakerOutput` attributes. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataProcessing.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataProcessing.html). 
   + For CSV-formatted input files, the joined input data is followed by the transformed data and the output is a CSV file.

If you use an algorithm with the `DataProcessing` structure, it must support your chosen format for *both* input and output files. For example, with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html) field of the `CreateTransformJob` API, you must set both the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-Accept](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-Accept) parameters to one of the following values: `text/csv`, `application/json`, or `application/jsonlines`. The syntax for specifying columns in a CSV file and specifying attributes in a JSON file are different. Using the wrong syntax causes an error. For more information, see [Batch Transform Examples](#batch-transform-data-processing-examples). For more information about input and output file formats for built-in algorithms, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

The record delimiters for the input and output must also be consistent with your chosen file input. The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType) parameter indicates how to split the records in the input dataset. The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith                     ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith                     ) parameter indicates how to reassemble the records for the output. If you set input and output formats to `text/csv`, you must also set the `SplitType` and `AssembleWith` parameters to `line`. If you set the input and output formats to `application/jsonlines`, you can set both `SplitType` and `AssembleWith` to `line`.

For CSV files, you cannot use embedded newline characters. For JSON files, the attribute name `SageMakerOutput` is reserved for output. The JSON input file can't have an attribute with this name. If it does, the data in the input file might be overwritten. 

## Supported JSONPath Operators
<a name="data-processing-operators"></a>

To filter and join the input data and inference, use a JSONPath subexpression. SageMaker AI supports only a subset of the defined JSONPath operators. The following table lists the supported JSONPath operators. For CSV data, each row is taken as a JSON array, so only index based JSONPaths can be applied, e.g. `$[0]`, `$[1:]`. CSV data should also follow [RFC format](https://tools.ietf.org/html/rfc4180).


| JSONPath Operator | Description | Example | 
| --- | --- | --- | 
| \$1 |  The root element to a query. This operator is required at the beginning of all path expressions.  | \$1 | 
| .<name> |  A dot-notated child element.  |  `$.id`  | 
| \$1 |  A wildcard. Use in place of an attribute name or numeric value.  |  `$.id.*`  | 
| ['<name>' (,'<name>')] |  A bracket-notated element or multiple child elements.  |  `$['id','SageMakerOutput']`  | 
| [<number> (,<number>)] |  An index or array of indexes. Negative index values are also supported. A `-1` index refers to the last element in an array.  |  `$[1]` , `$[1,3,5]`  | 
| [<start>:<end>] |  An array slice operator.  The array slice() method extracts a section of an array and returns a new array. If you omit *<start>*, SageMaker AI uses the first element of the array. If you omit *<end>*, SageMaker AI uses the last element of the array.  |  `$[2:5]`, `$[:5]`, `$[2:]`  | 

When using the bracket-notation to specify multiple child elements of a given field, additional nesting of children within brackets is not supported. For example, `$.field1.['child1','child2']` is supported while `$.field1.['child1','child2.grandchild']` is not. 

For more information about JSONPath operators, see [JsonPath](https://github.com/json-path/JsonPath) on GitHub.

## Batch Transform Examples
<a name="batch-transform-data-processing-examples"></a>

The following examples show some common ways to join input data with prediction results.

**Topics**
+ [

### Example: Output Only Inferences
](#batch-transform-data-processing-example-default)
+ [

### Example: Output Inferences Joined with Input Data
](#batch-transform-data-processing-example-all)
+ [

### Example: Output Inferences Joined with Input Data and Exclude the ID Column from the Input (CSV)
](#batch-transform-data-processing-example-select-csv)
+ [

### Example: Output Inferences Joined with an ID Column and Exclude the ID Column from the Input (CSV)
](#batch-transform-data-processing-example-select-json)

### Example: Output Only Inferences
<a name="batch-transform-data-processing-example-default"></a>

By default, the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-DataProcessing](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-DataProcessing) parameter doesn't join inference results with input. It outputs only the inference results.

If you want to explicitly specify to not join results with input, use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and specify the following settings in a transformer call.

```
sm_transformer = sagemaker.transformer.Transformer(…)
sm_transformer.transform(…, input_filter="$", join_source= "None", output_filter="$")
```

To output inferences using the AWS SDK for Python, add the following code to your CreateTransformJob request. The following code mimics the default behavior.

```
{
    "DataProcessing": {
        "InputFilter": "$",
        "JoinSource": "None",
        "OutputFilter": "$"
    }
}
```

### Example: Output Inferences Joined with Input Data
<a name="batch-transform-data-processing-example-all"></a>

If you're using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to combine the input data with the inferences in the output file, specify the `assemble_with` and `accept` parameters when initializing the transformer object. When you use the transform call, specify `Input` for the `join_source` parameter, and specify the `split_type` and `content_type` parameters as well. The `split_type` parameter must have the same value as `assemble_with`, and the `content_type` parameter must have the same value as `accept`. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, join_source="Input", split_type="Line", content_type="text/csv")
```

If you're using the AWS SDK for Python (Boto 3), join all input data with the inference by adding the following code to your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request. The values for `Accept` and `ContentType` must match, and the values for `AssembleWith` and `SplitType` must also match.

```
{
    "DataProcessing": {
        "JoinSource": "Input"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

For JSON or JSON Lines input files, the results are in the `SageMakerOutput` key in the input JSON file. For example, if the input is a JSON file that contains the key-value pair `{"key":1}`, the data transform result might be `{"label":1}`.

SageMaker AI stores both in the input file in the `SageMakerInput` key.

```
{
    "key":1,
    "SageMakerOutput":{"label":1}
}
```

**Note**  
The joined result for JSON must be a key-value pair object. If the input isn't a key-value pair object, SageMaker AI creates a new JSON file. In the new JSON file, the input data is stored in the `SageMakerInput` key and the results are stored as the `SageMakerOutput` value.

For a CSV file, for example, if the record is `[1,2,3]`, and the label result is `[1]`, then the output file would contain `[1,2,3,1]`.

### Example: Output Inferences Joined with Input Data and Exclude the ID Column from the Input (CSV)
<a name="batch-transform-data-processing-example-select-csv"></a>

If you are using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to join your input data with the inference output while excluding an ID column from the transformer input, specify the same parameters from the preceding example as well as a JSONPath subexpression for the `input_filter` in your transformer call. For example, if your input data includes five columns and the first one is the ID column, use the following transform request to select all columns except the ID column as features. The transformer still outputs all of the input columns joined with the inferences. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, split_type="Line", content_type="text/csv", input_filter="$[1:]", join_source="Input")
```

If you are using the AWS SDK for Python (Boto 3), add the following code to your `[ CreateTransformJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html)` request.

```
{
    "DataProcessing": {
        "InputFilter": "$[1:]",
        "JoinSource": "Input"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

To specify columns in SageMaker AI, use the index of the array elements. The first column is index 0, the second column is index 1, and the sixth column is index 5.

To exclude the first column from the input, set `[InputFilter](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter )` to `"$[1:]"`. The colon (`:`) tells SageMaker AI to include all of the elements between two values, inclusive. For example, `$[1:4]` specifies the second through fifth columns.

If you omit the number after the colon, for example, `[5:]`, the subset includes all columns from the 6th column through the last column. If you omit the number before the colon, for example, `[:5]`, the subset includes all columns from the first column (index 0) through the sixth column.

### Example: Output Inferences Joined with an ID Column and Exclude the ID Column from the Input (CSV)
<a name="batch-transform-data-processing-example-select-json"></a>

If you are using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), you can specify the output to join only specific input columns (such as the ID column) with the inferences by specifying the `output_filter` in the transformer call. The `output_filter` uses a JSONPath subexpression to specify which columns to return as output after joining the input data with the inference results. The following request shows how you can make predictions while excluding an ID column and then join the ID column with the inferences. Note that in the following example, the last column (`-1`) of the output contains the inferences. If you are using JSON files, SageMaker AI stores the inference results in the attribute `SageMakerOutput`. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, split_type="Line", content_type="text/csv", input_filter="$[1:]", join_source="Input", output_filter="$[0,-1]")
```

If you are using the AWS SDK for Python (Boto 3), join only the ID column with the inferences by adding the following code to your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request.

```
{
    "DataProcessing": {
        "InputFilter": "$[1:]",
        "JoinSource": "Input",
        "OutputFilter": "$[0,-1]"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

**Warning**  
If you are using a JSON-formatted input file, the file can't contain the attribute name `SageMakerOutput`. This attribute name is reserved for the inferences in the output file. If your JSON-formatted input file contains an attribute with this name, values in the input file might be overwritten with the inference.

# Storage in Batch Transform
<a name="batch-transform-storage"></a>

When you run a batch transform job, Amazon SageMaker AI attaches an Amazon Elastic Block Store storage volume to Amazon EC2 instances that process your job. The volume stores your model, and the size of the storage volume is fixed at 30 GB. You have the option to encrypt your model at rest in the storage volume.

**Note**  
If you have a large model, you may encounter an `InternalServerError`.

For more information about Amazon EBS storage and features, see the following pages:
+ [Amazon EBS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) in the Amazon EC2 User Guide
+ [Amazon EBS volumes](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes.html) in the Amazon EC2 User Guide

**Note**  
G4dn instances come with their own local SSD storage. To learn more about G4dn instances, see the [Amazon EC2 G4 Instances](https://aws.amazon.com/ec2/instance-types/g4/) page.

# Troubleshooting
<a name="batch-transform-errors"></a>

If you are having errors in Amazon SageMaker AI Batch Transform, refer to the following troubleshooting tips.

## Max timeout errors
<a name="batch-transform-errors-max-timeout"></a>

If you are getting max timeout errors when running batch transform jobs, try the following:
+ Begin with the single-record `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-BatchStrategy)`, a batch size of the default (6 MB) or smaller which you specify in the `[MaxPayloadInMB](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB)` parameter, and a small sample dataset. Tune the maximum timeout parameter `[InvocationsTimeoutInSeconds](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelClientConfig.html#sagemaker-Type-ModelClientConfig-InvocationsTimeoutInSeconds)` (which has a maximum of 1 hour) until you receive a successful invocation response.
+ After you receive a successful invocation response, increase the `MaxPayloadInMB` (which has a maximum of 100 MB) and the `InvocationsTimeoutInSeconds` parameters together to find the maximum batch size that can support your desired model timeout. You can use either the single-record or multi-record `BatchStrategy` in this step.
**Note**  
Exceeding the `MaxPayloadInMB` limit causes an error. This might happen with a large dataset if it can't be split, the `SplitType` parameter is set to none, or individual records within the dataset exceed the limit.
+ (Optional) Tune the `[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxConcurrentTransforms)` parameter, which specifies the maximum number of parallel requests that can be sent to each instance in a batch transform job. However, the value of `MaxConcurrentTransforms * MaxPayloadInMB` must not exceed 100 MB.

## Incomplete output
<a name="batch-transform-errors-incomplete"></a>

SageMaker AI uses the Amazon S3 [Multipart Upload API](https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html) to upload results from a batch transform job to Amazon S3. If an error occurs, the uploaded results are removed from Amazon S3. In some cases, such as when a network outage occurs, an incomplete multipart upload might remain in Amazon S3. An incomplete upload might also occur if you have multiple input files but some of the files can’t be processed by SageMaker AI Batch Transform. The input files that couldn’t be processed won’t have corresponding output files in Amazon S3.

To avoid incurring storage charges, we recommend that you add the [S3 bucket policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config) to the S3 bucket lifecycle rules. This policy deletes incomplete multipart uploads that might be stored in the S3 bucket. For more information, see [Object Lifecycle Management](https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html).

## Job shows as `failed`
<a name="batch-transform-errors-failed"></a>

If a batch transform job fails to process an input file because of a problem with the dataset, SageMaker AI marks the job as `failed`. If an input file contains a bad record, the transform job doesn't create an output file for that input file because doing so prevents it from maintaining the same order in the transformed data as in the input file. When your dataset has multiple input files, a transform job continues to process input files even if it fails to process one. The processed files still generate useable results.

If you are using your own algorithms, you can use placeholder text, such as `ERROR`, when the algorithm finds a bad record in an input file. For example, if the last record in a dataset is bad, the algorithm places the placeholder text for that record in the output file.

# Model parallelism and large model inference
<a name="large-model-inference"></a>

 Amazon SageMaker AI includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI). In the following sections, you can find resources to get started with LMI on SageMaker AI. 

**Topics**
+ [

# The large model inference (LMI) container documentation
](large-model-inference-container-docs.md)
+ [

# SageMaker AI endpoint parameters for large model inference
](large-model-inference-hosting.md)
+ [

# Deploying uncompressed models
](large-model-inference-uncompressed.md)
+ [

# Deploy large models for inference with TorchServe
](large-model-inference-tutorials-torchserve.md)

# The large model inference (LMI) container documentation
<a name="large-model-inference-container-docs"></a>

The [Large Model Inference (LMI) container documentation](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html) is provided on the Deep Java Library documentation site. 

The documentation is written for developers, data scientists, and machine learning engineers who need to deploy and optimize large language models (LLMs) on Amazon SageMaker AI. It helps you use LMI containers, which are specialized Docker containers for LLM inference, provided by AWS. It provides an overview, deployment guides, user guides for supported inference libraries, and advanced tutorials.

By using the LMI container documentation, you can:
+ Understand the components and architecture of LMI containers
+ Learn how to select the appropriate instance type and backend for your use case
+ Configure and deploy LLMs on SageMaker AI using LMI containers
+ Optimize performance by using features like quantization, tensor parallelism, and continuous batching
+ Benchmark and tune your SageMaker AI endpoints for optimal throughput and latency

# SageMaker AI endpoint parameters for large model inference
<a name="large-model-inference-hosting"></a>

 You can customize the following parameters to facilitate low-latency large model inference (LMI) with SageMaker AI: 
+  **Maximum Amazon EBS volume size on the instance (`VolumeSizeInGB`)** – If the size of the model is larger than 30 GB and you are using an instance without a local disk, you should increase this parameter to be slightly larger than the size of your model. 
+  **Health check timeout quota (`ContainerStartupHealthCheckTimeoutInSeconds`)** – If your container is correctly set up and the CloudWatch logs indicate a health check timeout, you should increase this quota so the container has enough time to respond to health checks. 
+  **Model download timeout quota (`ModelDataDownloadTimeoutInSeconds`)** – If the size of your model is larger than 40 GB, then you should increase this quota to provide sufficient time to download the model from Amazon S3 to the instance. 

The following code snippet demonstrates how to programatically configure the aforementioned parameters. Replace the *italicized placeholder text* in the example with your own information. 

```
import boto3

aws_region = "aws-region"
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name = "endpoint-name"

# Create an endpoint config name.
endpoint_config_name = "endpoint-config-name"

# The name of the model that you want to host.
model_name = "the-name-of-your-model"

instance_type = "instance-type"

sagemaker_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name,
            "InstanceType": instance_type, # Specify the compute instance type.
            "InitialInstanceCount": 1, # Number of instances to launch initially.
            "VolumeSizeInGB": 256, # Specify the size of the Amazon EBS volume.
            "ModelDataDownloadTimeoutInSeconds": 1800, # Specify the model download timeout in seconds.
            "ContainerStartupHealthCheckTimeoutInSeconds": 1800, # Specify the health checkup timeout in seconds
        },
    ],
)

sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name)
```

 For more information about the keys for `ProductionVariants`, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). 

For examples that demonstrate how to achieve low latency inference with large models, see [ Generative AI Inference Examples on Amazon SageMaker AI](https://github.com/aws-samples/sagemaker-genai-hosting-examples/tree/main) in the aws-samples GitHub repository. 

# Deploying uncompressed models
<a name="large-model-inference-uncompressed"></a>

 When deploying ML models, one option is to archive and compress the model artifacts into a `tar.gz` format. Although this method works well for small models, compressing a large model artifact with hundreds of billions of parameters and then decompressing it on an endpoint can take a significant amount of time. For large model inference, we recommend that you deploy uncompressed ML model. This guide shows how you can deploy uncompressed ML model. 

 To deploy uncompressed ML models, upload all model artifacts to Amazon S3 and organize them under a common Amazon S3 prefix. A Amazon S3 prefix is a string of characters at the beginning of an Amazon S3 object key name, separated from the rest of the name by a delimiter. For more information on Amazon S3 prefix, see [Organizing objects using prefixes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html). 

 For deploying with SageMaker AI, you must use slash (/) as the delimiter. You have to ensure that only artifacts associated with your ML model are organized with the prefix. For ML models with a single uncompressed artifact, the prefix will be identical to the key name. You can check which objects are associated with your prefix with the AWS CLI: 

```
aws s3 ls --recursive s3://bucket/prefix
```

 After uploading the model artifacts to Amazon S3 and organizing them under a common prefix, you can specify their location as part of the [ModelDataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelDataSource.html) field when you invoke the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) request. SageMaker AI will automatically download the uncompressed model artifacts to `/opt/ml/model` for inference. For more information about the rules that SageMaker AI uses when downloading the artifacts, see [S3ModelDataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3ModelDataSource.html). 

 The following code snippet shows how you can invoke the `CreateModel` API when deploying an uncompressed model. Replace the *italicized user text* with your own information. 

```
model_name = "model-name"
sagemaker_role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
container = "123456789012.dkr.ecr.us-west-2.amazonaws.com/inference-image:latest"

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        "Image": container,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": "s3://amzn-s3-demo-bucket/prefix/to/model/data/", 
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
    },
)
```

 The aforementioned example assumes that your model artifacts are organized under a common prefix. If instead your model artifact is a single uncompressed Amazon S3 object, then change `"S3Uri"` to point to the Amazon S3 object, and change `"S3DataType"` to `"S3Object"`. 

**Note**  
 Currently you cannot use `ModelDataSource` with AWS Marketplace, SageMaker AI batch transform, SageMaker Serverless Inference endpoints, and SageMaker multi-model endpoints. 

# Deploy large models for inference with TorchServe
<a name="large-model-inference-tutorials-torchserve"></a>

This tutorial demonstrates how to deploy large models and serve inference in Amazon SageMaker AI with TorchServe on GPUs. This example deploys the [OPT-30b](https://huggingface.co/facebook/opt-30b) model to an `ml.g5` instance. You can modify this to work with other models and instance types. Replace the `italicized placeholder text` in the examples with your own information.

TorchServe is a powerful open platform for large distributed model inference. By supporting popular libraries like PyTorch, native PiPPy, DeepSpeed, and HuggingFace Accelerate, it offers uniform handler APIs that remain consistent across distributed large model and non-distributed model inference scenarios. For more information, see [TorchServe’s large model inference documentation](https://pytorch.org/serve/large_model_inference.html#).

## Deep learning containers with TorchServe
<a name="large-model-inference-tutorials-torchserve-dlcs"></a>

To deploy a large model with TorchServe on SageMaker AI, you can use one of the SageMaker AI deep learning containers (DLCs). By default, TorchServe is installed in all AWS PyTorch DLCs. During model loading, TorchServe can install specialized libraries tailored for large models such as PiPPy, Deepspeed, and Accelerate.

The following table lists all of the [SageMaker AI DLCs with TorchServe](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).


| DLC cateogry | Framework | Hardware | Example URL | 
| --- | --- | --- | --- | 
| [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) |  PyTorch 2.0.0\$1  | CPU, GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [SageMaker AI Framework Graviton Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-graviton-containers-sm-support-only) |  PyTorch 2.0.0\$1  | CPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.1-cpu-py310-ubuntu20.04-sagemaker | 
| [StabilityAI Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#stabilityai-inference-containers) |  PyTorch 2.0.0\$1  | GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/stabilityai-pytorch-inference:2.0.1-sgm0.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) | PyTorch 1.13.1 | Neuronx | 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04 | 

## Getting started
<a name="large-model-inference-tutorials-torchserve-getting-started"></a>

Before deploying your model, complete the prerequisites. You can also configure your model parameters and customize the handler code.

### Prerequisites
<a name="large-model-inference-tutorials-torchserve-getting-started-prereqs"></a>

To get started, ensure that you have the following prerequisites:

1. Ensure you have access to an AWS account. [Set up your environment](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AWSServiceRoleForAmazonEKSNodegroup](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSServiceRoleForAmazonEKSNodegroup)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

   For more information about attaching IAM policies to a role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *AWS IAM User Guide*.

1. Locally configure your dependencies, as shown in the following examples.

   1. Install version 2 of the AWS CLI:

      ```
      # Install the latest AWS CLI v2 if it is not installed
      !curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" !unzip awscliv2.zip
      #Follow the instructions to install v2 on the terminal
      !cat aws/README.md
      ```

   1. Install SageMaker AI and the Boto3 client:

      ```
      # If already installed, update your client
      #%pip install sagemaker pip --upgrade --quiet
      !pip install -U sagemaker
      !pip install -U boto
      !pip install -U botocore
      !pip install -U boto3
      ```

### Configure model settings and parameters
<a name="large-model-inference-tutorials-torchserve-getting-started-config"></a>

TorchServe uses [https://pytorch.org/docs/stable/elastic/run.html](https://pytorch.org/docs/stable/elastic/run.html) to set up the distributed environment for model parallel processing. TorchServe has the capability to support multiple workers for a large model. By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. In the case of large model inference, the number of GPUs assigned to each worker is automatically calculated based on the number of GPUs specified in the `model_config.yaml` file. The environment variable `CUDA_VISIBLE_DEVICES`, which specifies the GPU device IDs that are visible at a given time, is set based this number.

For example, suppose there are 8 GPUs on a node and one worker needs 4 GPUs on a node (`nproc_per_node=4`). In this case, TorchServe assigns four GPUs to the first worker (`CUDA_VISIBLE_DEVICES="0,1,2,3"`) and four GPUs to the second worker (`CUDA_VISIBLE_DEVICES="4,5,6,7”`).

In addition to this default behavior, TorchServe provides the flexibility for users to specify GPUs for a worker. For instance, if you set the variable `deviceIds: [2,3,4,5]` in the [model config YAML file](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md?plain=1#L164), and set `nproc_per_node=2`, then TorchServe assigns `CUDA_VISIBLE_DEVICES=”2,3”` to the first worker and `CUDA_VISIBLE_DEVICES="4,5”` to the second worker.

In the following `model_config.yaml` example, we configure both front-end and back-end parameters for the [OPT-30b ](https://huggingface.co/facebook/opt-30b) model. The configured front-end parameters are `parallelType`, `deviceType`, `deviceIds `and `torchrun`. For more detailed information about the front-end parameters you can configure, see the [PyTorch GitHub documentation](https://github.com/pytorch/serve/blob/2bf505bae3046b0f7d0900727ec36e611bb5dca3/docs/configuration.md?plain=1#L267). The back-end configuration is based on a YAML map that allows for free-style customization. For the back-end parameters, we define the DeepSpeed configuration and additional parameters used by custom handler code.

```
# TorchServe front-end parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # sets CUDA_VISIBLE_DEVICES

torchrun:
    nproc-per-node: 4

# TorchServe back-end parameters
deepspeed:
    config: ds-config.json
    checkpoint: checkpoints.json

handler: # parameters for custom handler code
    model_name: "facebook/opt-30b"
    model_path: "model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
    max_length: 50
    max_new_tokens: 10
    manual_seed: 40
```

### Customize handlers
<a name="large-model-inference-tutorials-torchserve-getting-started-handlers"></a>

TorchServe offers[ base handlers](https://github.com/pytorch/serve/tree/master/ts/torch_handler/distributed) and [handler utilities](https://github.com/pytorch/serve/tree/master/ts/handler_utils) for large model inference built with popular libraries. The following example demonstrates how the custom handler class [TransformersSeqClassifierHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/examples/large_models/deepspeed/custom_handler.py#L16C7-L16C39) extends [BaseDeepSpeedHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/ts/torch_handler/distributed/base_deepspeed_handler.py#L8) and uses the [handler utilities](https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py). For a full code example, see the [`custom_handler.py` code on the PyTorch GitHub documentation](https://github.com/pytorch/serve/blob/master/examples/large_models/deepspeed/custom_handler.py).

```
class TransformersSeqClassifierHandler(BaseDeepSpeedHandler, ABC):
    """
    Transformers handler class for sequence, token classification and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.max_length = None
        self.max_new_tokens = None
        self.tokenizer = None
        self.initialized = False

    def initialize(self, ctx: Context):
        """In this initialize function, the HF large model is loaded and
        partitioned using DeepSpeed.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artifacts parameters.
        """
        super().initialize(ctx)
        model_dir = ctx.system_properties.get("model_dir")
        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
        model_name = ctx.model_yaml_config["handler"]["model_name"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]
        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
        torch.manual_seed(seed)

        logger.info("Model %s loading tokenizer", ctx.model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        config = AutoConfig.from_pretrained(model_name)
        with torch.device("meta"):
            self.model = AutoModelForCausalLM.from_config(
                config, torch_dtype=torch.float16
            )
        self.model = self.model.eval()

        ds_engine = get_ds_engine(self.model, ctx)
        self.model = ds_engine.module
        logger.info("Model %s loaded successfully", ctx.model_name)
        self.initialized = True

    def preprocess(self, requests):
        """
        Basic text preprocessing, based on the user's choice of application mode.
        Args:
            requests (list): A list of dictionaries with a "data" or "body" field, each
                            containing the input text to be processed.
        Returns:
            tuple: A tuple with two tensors: the batch of input ids and the batch of
                attention masks.
        """

    def inference(self, input_batch):
        """
        Predicts the class (or classes) of the received text using the serialized transformers
        checkpoint.
        Args:
            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
                                of attention masks, as returned by the preprocess function.
        Returns:
            list: A list of strings with the predicted values for each input text in the batch.
        """
        
    def postprocess(self, inference_output):
        """Post Process Function converts the predicted response into Torchserve readable format.
        Args:
            inference_output (list): It contains the predicted response of the input text.
        Returns:
            (list): Returns a list of the Predictions and Explanations.
        """
```

## Prepare your model artifacts
<a name="large-model-inference-tutorials-torchserve-artifacts"></a>

Before deploying your model on SageMaker AI, you must package your model artifacts. For large models, we recommend that you use the PyTorch [torch-model-archiver](https://github.com/pytorch/serve/blob/master/model-archiver/README.md) tool with the argument `--archive-format no-archive`, which skips compressing model artifacts. The following example saves all of the model artifacts to a new folder named `opt/`.

```
torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format no-archive
```

Once the `opt/` folder is created, download the OPT-30b model to the folder using the PyTorch [Download\$1model](https://github.com/pytorch/serve/blob/master/examples/large_models/utils/Download_model.py) tool.

```
cd opt
python path_to/Download_model.py --model_path model --model_name facebook/opt-30b --revision main
```

Lastly, upload the model artifacts to an Amazon S3 bucket. 

```
aws s3 cp opt {your_s3_bucket}/opt --recursive
```

You should now have model artifacts stored in Amazon S3 that are ready to deploy to a SageMaker AI endpoint.

## Deploy the model using the SageMaker Python SDK
<a name="large-model-inference-tutorials-torchserve-deploy"></a>

After preparing your model artifacts, you can deploy your model to a SageMaker AI Hosting endpoint. This section describes how to deploy a single large model to an endpoint and make streaming response predictions. For more information about streaming responses from endpoints, see [Invoke real-time endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html).

To deploy your model, complete the following steps:

1. Create a SageMaker AI session, as shown in the following example.

   ```
   import boto3
   import sagemaker
   from sagemaker import Model, image_uris, serializers, deserializers
   
   boto3_session=boto3.session.Session(region_name="us-west-2")
   smr = boto3.client('sagemaker-runtime-demo')
   sm = boto3.client('sagemaker')
   role = sagemaker.get_execution_role()  # execution role for the endpoint
   sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # SageMaker AI session for interacting with different AWS APIs
   region = sess._region_name  # region name of the current SageMaker Studio Classic environment
   account = sess.account_id()  # account_id of the current SageMaker Studio Classic environment
   
   # Configuration:
   bucket_name = sess.default_bucket()
   prefix = "torchserve"
   output_path = f"s3://{bucket_name}/{prefix}"
   print(f'account={account}, region={region}, role={role}, output_path={output_path}')
   ```

1. Create an uncompressed model in SageMaker AI, as shown in the following example.

   ```
   from datetime import datetime
   
   instance_type = "ml.g5.24xlarge"
   endpoint_name = sagemaker.utils.name_from_base("ts-opt-30b")
   s3_uri = {your_s3_bucket}/opt
   
   model = Model(
       name="torchserve-opt-30b" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
       # Enable SageMaker uncompressed model artifacts
       model_data={
           "S3DataSource": {
                   "S3Uri": s3_uri,
                   "S3DataType": "S3Prefix",
                   "CompressionType": "None",
           }
       },
       image_uri=container,
       role=role,
       sagemaker_session=sess,
       env={"TS_INSTALL_PY_DEP_PER_MODEL": "true"},
   )
   print(model)
   ```

1. Deploy the model to an Amazon EC2 instance, as shown in the following example.

   ```
   model.deploy(
       initial_instance_count=1,
       instance_type=instance_type,
       endpoint_name=endpoint_name,
       volume_size=512, # increase the size to store large model
       model_data_download_timeout=3600, # increase the timeout to download large model
       container_startup_health_check_timeout=600, # increase the timeout to load large model
   )
   ```

1. Initialize a class to process the streaming response, as shown in the following example.

   ```
   import io
   
   class Parser:
       """
       A helper class for parsing the byte stream input. 
       
       The output of the model will be in the following format:
       ```
       b'{"outputs": [" a"]}\n'
       b'{"outputs": [" challenging"]}\n'
       b'{"outputs": [" problem"]}\n'
       ...
       ```
       
       While usually each PayloadPart event from the event stream will contain a byte array 
       with a full json, this is not guaranteed and some of the json objects may be split across
       PayloadPart events. For example:
       ```
       {'PayloadPart': {'Bytes': b'{"outputs": '}}
       {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
       ```
       
       This class accounts for this by concatenating bytes written via the 'write' function
       and then exposing a method which will return lines (ending with a '\n' character) within
       the buffer via the 'scan_lines' function. It maintains the position of the last read 
       position to ensure that previous bytes are not exposed again. 
       """
       
       def __init__(self):
           self.buff = io.BytesIO()
           self.read_pos = 0
           
       def write(self, content):
           self.buff.seek(0, io.SEEK_END)
           self.buff.write(content)
           data = self.buff.getvalue()
           
       def scan_lines(self):
           self.buff.seek(self.read_pos)
           for line in self.buff.readlines():
               if line[-1] != b'\n':
                   self.read_pos += len(line)
                   yield line[:-1]
                   
       def reset(self):
           self.read_pos = 0
   ```

1. Test a streaming response prediction, as shown in the following example.

   ```
   import json
   
   body = "Today the weather is really nice and I am planning on".encode('utf-8')
   resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=body, ContentType="application/json")
   event_stream = resp['Body']
   parser = Parser()
   for event in event_stream:
       parser.write(event['PayloadPart']['Bytes'])
       for line in parser.scan_lines():
           print(line.decode("utf-8"), end=' ')
   ```

You have now deployed your model to a SageMaker AI endpoint and should be able to invoke it for responses. For more information about SageMaker AI real-time endpoints, see [Single-model endpoints](realtime-single-model.md).

# Deployment guardrails for updating models in production
<a name="deployment-guardrails"></a>

Deployment guardrails are a set of model deployment options in Amazon SageMaker AI Inference to update your machine learning models in production. Using the fully managed deployment options, you can control the switch from the current model in production to a new one. Traffic shifting modes in blue/green deployments, such as canary and linear, give you granular control over the traffic shifting process from your current model to the new one during the course of the update. There are also built-in safeguards such as auto-rollbacks that help you catch issues early and automatically take corrective action before they significantly impact production.

Deployment guardrails provide the following benefits:
+ **Deployment safety while updating production environments.** A regressive update to a production environment can cause unplanned downtime and business impact, such as increased model latency and high error rates. Deployment guardrails help you mitigate those risks by providing best practices and built-in operational safety guardrails.
+ **Fully managed deployment.** SageMaker AI takes care of setting up and orchestrating these deployments and integrates them with endpoint update mechanisms. You do not need to build and maintain orchestration, monitoring, or rollback mechanisms. You can leverage SageMaker AI to set up and orchestrate these deployments and focus on leveraging ML for your applications.
+ **Visibility.** You can track the progress of your deployment through the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API or through Amazon CloudWatch Events (for [supported endpoints](deployment-guardrails-exclusions.md)). To learn more about events in SageMaker AI, see the Endpoint deployment state change section in [Events that Amazon SageMaker AI sends to Amazon EventBridge](automating-sagemaker-with-eventbridge.md). Note that if your endpoint uses any of the features in the [Exclusions](deployment-guardrails-exclusions.md) page, you cannot use CloudWatch Events.

**Note**  
Deployment guardrails only apply to [Asynchronous inference](async-inference.md) and [Real-time inference](realtime-endpoints.md) endpoint types.

## How to get started
<a name="deployment-guardrails-get-started"></a>

We support two types of deployments to update models in production: blue/green deployments and rolling deployments.
+ [Blue/Green Deployments](deployment-guardrails-blue-green.md): You can shift traffic from your old fleet (the blue fleet) to a new fleet (green fleet) with the updates. Blue/green deployments offer [multiple traffic shifting modes](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-blue-green.html). A traffic shifting mode is a configuration that specifies how SageMaker AI routes endpoint traffic to a new fleet containing your updates. The following traffic shifting modes provide you with different levels of control over the endpoint update process:
  + [Use all at once traffic shifting](deployment-guardrails-blue-green-all-at-once.md) shifts all of your endpoint traffic from the blue fleet to the green fleet. Once the traffic shifts to the green fleet, your pre-specified Amazon CloudWatch alarms begin monitoring the green fleet for a set amount of time (the *baking period*). If no alarms trip during the baking period, then SageMaker AI terminates the blue fleet.
  + [Use canary traffic shifting](deployment-guardrails-blue-green-canary.md) shifts one small portion of your traffic (a *canary*) to the green fleet and monitor it for a baking period. If the canary succeeds on the green fleet, then SageMaker AI shifts the rest of the traffic from the blue fleet to the green fleet before terminating the blue fleet.
  + [Use linear traffic shifting](deployment-guardrails-blue-green-linear.md) provides even more customization over the number of traffic-shifting steps and the percentage of traffic to shift for each step. While canary shifting lets you shift traffic in two steps, linear shifting extends this to *n* linearly spaced steps.
+ [Use rolling deployments](deployment-guardrails-rolling.md): You can update your endpoint as SageMaker AI incrementally provisions capacity and shifts traffic to a new fleet in steps of a batch size that you specify. Instances on the new fleet are updated with the new deployment configuration, and if no CloudWatch alarms trip during the baking period, then SageMaker AI cleans up instances on the old fleet. This option gives you granular control over the instance count or capacity percentage shifted during each step.

You can create and manage your deployment through the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) and [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) SageMaker API and AWS Command Line Interface commands. See the individual deployment pages for more details on how to set up your deployment. Note that if your endpoint uses any of the features listed in the [Exclusions](deployment-guardrails-exclusions.md) page, you cannot use deployment guardrails.

To follow guided examples that shows how to use deployment guardrails, see our example [Jupyter notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-inference-deployment-guardrails) for the canary and linear traffic shifting modes.

# Auto-Rollback Configuration and Monitoring
<a name="deployment-guardrails-configuration"></a>

Amazon CloudWatch alarms are a prerequisite for using baking periods in deployment guardrails. You can only use the auto-rollback functionality in deployment guardrails if you set up CloudWatch alarms that can monitor an endpoint. If any of your alarms trip during the specified monitoring period, SageMaker AI initiates a complete rollback to the old endpoint to protect your application. If you do not have any CloudWatch alarms set up to monitor your endpoint, then the auto-rollback functionality does not work during your deployment.

To learn more about Amazon CloudWatch, see [What is Amazon CloudWatch?](https://docs.aws.amazon.com/IAM/latest/UserGuide/access.html) in the *Amazon CloudWatch User Guide*.

**Note**  
Ensure that your IAM execution role has permission to perform the `cloudwatch:DescribeAlarms` action on the auto-rollback alarms you specify.

## Alarm Examples
<a name="deployment-guardrails-configuration-alarm-examples"></a>

To help you get started, we provide the following examples to demonstrate the capabilities of CloudWatch alarms. In addition to using or modifying the following examples, you can create your own alarms and configure the alarms to monitor various metrics on the specified fleets for a certain period of time. To see more SageMaker AI metrics and dimensions you can add to your alarms, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md).

**Topics**
+ [

### Monitor invocation errors on both old and new fleets
](#deployment-guardrails-configuration-alarm-examples-errors-both)
+ [

### Monitor model latency on the new fleet
](#deployment-guardrails-configuration-alarm-examples-latency-new)

### Monitor invocation errors on both old and new fleets
<a name="deployment-guardrails-configuration-alarm-examples-errors-both"></a>

The following CloudWatch alarm monitors an endpoint's average error rate. You can use this alarm with any deployment guardrails traffic shifting type to provide overall monitoring on both the old and new fleets. If the alarm trips, then SageMaker AI initiates a rollback to the old fleet.

Invocation errors coming from both the old fleet and new fleet contribute to the average error rate. If the average error rate exceeds the specified threshold, then the alarm trips. This particular example monitors the 4xx errors (client errors) on both the old and new fleets for the duration of a deployment. You can also monitor the 5xx errors (server errors) by using the metric `Invocation5XXErrors`.

**Note**  
For this alarm type, if your old fleet trips the alarm during the deployment, SageMaker AI terminates your deployment. Therefore, if your current production fleet already causes errors, consider using or modifying one of the following examples that only monitors the new fleet for errors.

```
#Applied deployment type: all types
{
    "AlarmName": "EndToEndDeploymentHighErrorRateAlarm",
    "AlarmDescription": "Monitors the error rate of 4xx errors",
    "MetricName": "Invocation4XXErrors",
    "Namespace": "AWS/SageMaker",
    "Statistic": "Average",
    "Dimensions": [
        {
            "Name": "EndpointName",
            "Value": <your-endpoint-name>
        },
        {
            "Name": "VariantName",
            "Value": "AllTraffic"
        }
    ],
    "Period": 600,
    "EvaluationPeriods": 2,
    "Threshold": 1,
    "ComparisonOperator": "GreaterThanThreshold",
    "TreatMissingData": "notBreaching"
}
```

In the previous example, note the values for the following fields:
+ For `AlarmName` and `AlarmDescription`, enter a name and description you choose for the alarm.
+ For `MetricName`, use the value `Invocation4XXErrors` to monitor for 4xx errors on the endpoint
+ For `Namespace`, use the value `AWS/SageMaker`. You can also specify your own custom metric, if applicable.
+ For `Statistic`, use `Average`. This means that the alarm takes the average error rate over the evaluation periods when calculating whether the error rate has exceeded the threshold.
+ For the dimension `EndpointName`, use the name of the endpoint you are updating as the value.
+ For the dimension `VariantName`, use the value `AllTraffic` to specify all endpoint traffic.
+ For `Period`, use `600`. This sets the alarm’s evaluation periods to 10 minutes long.
+ For `EvaluationPeriods`, use `2`. This value tells the alarm to consider the two most recent evaluation periods when determining the alarm status.

### Monitor model latency on the new fleet
<a name="deployment-guardrails-configuration-alarm-examples-latency-new"></a>

The following CloudWatch alarm example monitors the new fleet’s model latency during your deployment. You can use this alarm to monitor only the new fleet and exclude the old fleet. The alarm lasts for the entire deployment. This example gives you comprehensive, end-to-end monitoring of the new fleet and initiates a rollback to the old fleet if the new fleet has any response time issues.

CloudWatch publishes the metrics with the dimension `EndpointConfigName:{New-Ep-Config}` after the new fleet starts receiving traffic, and these metrics last even after the deployment is complete.

You can use the following alarm example with any deployment type.

```
#Applied deployment type: all types
{
    "AlarmName": "NewEndpointConfigVersionHighModelLatencyAlarm",
    "AlarmDescription": "Monitors the model latency on new fleet",
    "MetricName": "ModelLatency",
    "Namespace": "AWS/SageMaker",
    "Statistic": "Average",
    "Dimensions": [
        {
            "Name": "EndpointName",
            "Value": <your-endpoint-name>
        },
        {
            "Name": "VariantName",
            "Value": "AllTraffic"
        },
        {
            "Name": "EndpointConfigName",
            "Value": <your-config-name>
    ],
    "Period": 300,
    "EvaluationPeriods": 2,
    "Threshold": 100000, # 100ms
    "ComparisonOperator": "GreaterThanThreshold",
    "TreatMissingData": "notBreaching"
}
```

In the previous example, note the values for the following fields:
+ For `MetricName`, use the value `ModelLatency` to monitor the model’s response time.
+ For `Namespace`, use the value `AWS/SageMaker`. You can also specify your own custom metric, if applicable.
+ For the dimension `EndpointName`, use the name of the endpoint you are updating as the value.
+ For the dimension `VariantName`, use the value `AllTraffic` to specify all endpoint traffic.
+ For the dimension `EndpointConfigName`, the value should refer to the endpoint configuration name for your new or updated endpoint.

**Note**  
If you want to monitor your old fleet instead of the new fleet, you can change the dimension `EndpointConfigName` to specify the name of your old fleet’s configuration.

# Blue/Green Deployments
<a name="deployment-guardrails-blue-green"></a>

When you update your endpoint, Amazon SageMaker AI automatically uses a blue/green deployment to maximize the availability of your endpoints. In a blue/green deployment, SageMaker AI provisions a new fleet with the updates (the green fleet). Then, SageMaker AI shifts traffic from the old fleet (the blue fleet) to the green fleet. Once the green fleet operates smoothly for a set evaluation period (called the baking period), SageMaker AI terminates the blue fleet. With the additional capabilities in blue/green deployments, you can utilize traffic shifting modes and auto-rollback monitoring to protect your endpoint from significant production impact.

The following list describes the key features of blue/green deployments in SageMaker AI:
+ **Traffic shifting modes.** The traffic shifting modes for deployment guardrails let you control the volume of traffic and number of traffic-shifting steps between the blue fleet and the green fleet. This capability gives you the ability to progressively evaluate the performance of the green fleet without fully committing to a 100% traffic shift.
+ **Baking period.** The baking period is a set amount of time to monitor the green fleet before proceeding to the next deployment stage. If any of the pre-specified alarms trip during any baking period, then all endpoint traffic rolls back to the blue fleet. The baking period helps you to build confidence in your update before making the traffic shift permanent.
+ **Auto-rollbacks.** You can specify Amazon CloudWatch alarms that SageMaker AI uses to monitor the green fleet. If an issue with the updated code trips any of the alarms, SageMaker AI initiates an auto-rollback to the blue fleet in order to maintain availability thereby minimizing risk.

## Traffic Shifting Modes
<a name="deployment-guardrails-blue-green-traffic-modes"></a>

The various traffic shifting modes in blue/green deployments give you more granular control over traffic shifting between the blue fleet and the green fleet. The available traffic shifting modes for blue/green deployments are all at once, canary, and linear. The following table shows a comparison of the options.

**Important**  
For blue/green deployments that involve multiple stage traffic shifting or baking periods, you are billed for both the fleets for the duration of the update, irrespective of the traffic to the fleet. This is in contrast to blue/green deployments with all at once traffic shifting and no baking periods, where you are only billed for one fleet during the course of the update.


| Name | What is it? | Pros | Cons | Recommendation | 
| --- | --- | --- | --- | --- | 
| All at once | Shifts all of the traffic to the new fleet in a single step. | Minimizes the overall update duration. | Regressive updates affect 100% of the traffic. | Use this option to minimize update time and cost. | 
| Canary | Traffic shifts in two steps. The first (canary) step shifts a small portion of the traffic followed by the second step, which shifts the remainder of the traffic. | Confines the blast radius of regressive updates to only the canary fleet. | Both fleets are operational in parallel for entire deployment. | Use this option to balance between minimizing the blast radius of regressive updates and minimizing the time that two fleets are operational. | 
| Linear | A fixed portion of the traffic shifts in a pre-specified number of equally spaced steps. | Minimizes the risk of regressive updates by shifting traffic over several steps. | The update duration and cost are proportional to the number of steps. | Use this option to minimize risk by spreading out deployment across multiple steps. | 

## Get Started
<a name="deployment-guardrails-blue-green-get-started"></a>

Once you specify your desired deployment configuration, SageMaker AI handles provisioning new instances, terminating old instances, and shifting traffic for you. You can create and manage your deployment through the existing [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) and [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) SageMaker API and AWS Command Line Interface commands. Note that if your endpoint uses any of the features listed in the [Exclusions](deployment-guardrails-exclusions.md) page, you cannot use deployment guardrails. See the individual deployment pages for more details on how to set up your deployment:
+ [ Blue/Green Update with All At Once Traffic Shifting](deployment-guardrails-blue-green-all-at-once.md)
+ [ Blue/Green Update with Canary Traffic Shifting](deployment-guardrails-blue-green-canary.md)
+ [ Blue/Green Update with Linear Traffic Shifting](deployment-guardrails-blue-green-linear.md)

To follow guided examples that show how to use deployment guardrails, see our example [Jupyter notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-inference-deployment-guardrails) for the canary and linear traffic shifting modes.

# Use all at once traffic shifting
<a name="deployment-guardrails-blue-green-all-at-once"></a>

With all at once traffic shifting, you can quickly roll out an endpoint update using the safety guardrails of a blue/green deployment. You can use this traffic shifting option to minimize the update duration while still taking advantage of the availability guarantees of blue/green deployments. The baking period feature helps you to monitor the performance and functionality of your new instances before terminating your old instances, ensuring that your new fleet is fully operational.

The following diagram shows how all at once traffic shifting manages the old and new fleets.

![\[A successful 100% traffic shift from the old fleet to the new fleet.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deployment-guardrails-blue-green-all-at-once.png)


When you use all at once traffic shifting, SageMaker AI routes 100% of the traffic to the new fleet (green fleet). Once the green fleet starts receiving traffic, the baking period begins. The baking period is a set amount of time in which pre-specified Amazon CloudWatch alarms monitor the performance of the green fleet. If no alarms trip during the baking period, SageMaker AI terminates the old fleet (blue fleet). If any alarms trip during the baking period, then an auto-rollback initiates and 100% of the traffic shifts back to the blue fleet.

## Prerequisites
<a name="deployment-guardrails-blue-green-all-at-once-prereqs"></a>

Before setting up a deployment with all at once traffic shifting, you must create Amazon CloudWatch alarms to watch metrics from your endpoint. If any of the alarms trip during the baking period, then the traffic rolls back to your blue fleet. To learn how to set up CloudWatch alarms on an endpoint, see the prerequisite page [Auto-Rollback Configuration and Monitoring](deployment-guardrails-configuration.md). To learn more about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

## Configure All At Once Traffic Shifting
<a name="deployment-guardrails-blue-green-all-at-once-configure"></a>

Once you are ready for your deployment and have set up CloudWatch alarms for your endpoint, you can use either the SageMaker AI [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API or the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command in the AWS Command Line Interface to initiate the deployment.

**Topics**
+ [

### How to update an endpoint (API)
](#deployment-guardrails-blue-green-all-at-once-configure-api-update)
+ [

### How to update an endpoint with an existing blue/green update policy (API)
](#deployment-guardrails-blue-green-all-at-once-configure-api-existing)
+ [

### How to update an endpoint (CLI)
](#deployment-guardrails-blue-green-all-at-once-configure-cli-update)

### How to update an endpoint (API)
<a name="deployment-guardrails-blue-green-all-at-once-configure-api-update"></a>

The following example shows how you can update your endpoint with all at once traffic shifting using [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) in the Amazon SageMaker API.

```
import boto3
client = boto3.client("sagemaker")

response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    DeploymentConfig={
        "BlueGreenUpdatePolicy": {
            "TrafficRoutingConfiguration": {
                "Type": "ALL_AT_ONCE"
            },
            "TerminationWaitInSeconds": 600,
            "MaximumExecutionTimeoutInSeconds": 1800
        },
        "AutoRollbackConfiguration": {
            "Alarms": [
                {
                    "AlarmName": "<your-cw-alarm>"
                },
            ]
        }
    }
)
```

To configure the all at once traffic shifting option, do the following:
+ For `EndpointName`, use the name of the existing endpoint you want to update.
+ For `EndpointConfigName`, use the name of the endpoint configuration you want to use.
+ Under `DeploymentConfig` and `BlueGreenUpdatePolicy`, in `TrafficRoutingConfiguration`, set the `Type` parameter to `ALL_AT_ONCE`. This specifies that the deployment uses the all at once traffic shifting mode.
+ For `TerminationWaitInSeconds`, use `600`. This parameter tells SageMaker AI to wait for the specified amount of time (in seconds) after your green fleet is fully active before terminating the instances in the blue fleet. In this example, SageMaker AI waits for 10 minutes after the final baking period before terminating the blue fleet.
+ For `MaximumExecutionTimeoutInSeconds`, use `1800`. This parameter sets the maximum amount of time that the deployment can run before it times out. In the preceding example, your deployment has a limit of 30 minutes to finish.
+ In `AutoRollbackConfiguration`, within the `Alarms` field, you can add your CloudWatch alarms by name. Create one `AlarmName: <your-cw-alarm>` entry for each alarm you want to use.

### How to update an endpoint with an existing blue/green update policy (API)
<a name="deployment-guardrails-blue-green-all-at-once-configure-api-existing"></a>

When you use the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create an endpoint, you can optionally specify a deployment configuration to reuse for future endpoint updates. You can use the same `DeploymentConfig` options as the previous UpdateEndpoint API example. There are no changes to the CreateEndpoint API behavior. Specifying the deployment configuration does not automatically perform a blue/green update on your endpoint.

The option to use a previous deployment configuration happens when using the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API to update your endpoint. When updating your endpoint, you can use the `RetainDeploymentConfig` option to keep the deployment configuration you specified when you created the endpoint.

When calling the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, set `RetainDeploymentConfig` to `True` to keep the `DeploymentConfig` options from your original endpoint configuration.

```
response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    RetainDeploymentConfig=True
)
```

### How to update an endpoint (CLI)
<a name="deployment-guardrails-blue-green-all-at-once-configure-cli-update"></a>

If you are using the AWS CLI, the following example shows how to start a blue/green all at once deployment using the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command.

```
update-endpoint
--endpoint-name <your-endpoint-name> 
--endpoint-config-name <your-config-name> 
--deployment-config '"BlueGreenUpdatePolicy": {"TrafficRoutingConfiguration": {"Type": "ALL_AT_ONCE"},
    "TerminationWaitInSeconds": 600, "MaximumExecutionTimeoutInSeconds": 1800},
    "AutoRollbackConfiguration": {"Alarms": [{"AlarmName": "<your-alarm>"}]}'
```

To configure the all at once traffic shifting option, do the following:
+ For `endpoint-name`, use the name of the endpoint you want to update.
+ For `endpoint-config-name`, use the name of the endpoint configuration you want to use.
+ For `deployment-config`, use a [BlueGreenUpdatePolicy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BlueGreenUpdatePolicy.html) JSON object.

**Note**  
If you would rather save your JSON object in a file, see [Generating AWS CLI skeleton and input parameters](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-skeleton.html) in the *AWS CLI User Guide*.

# Use canary traffic shifting
<a name="deployment-guardrails-blue-green-canary"></a>

With canary traffic shifting, you can test a portion of your endpoint traffic on the new fleet while the old fleet serves the remainder of the traffic. This testing step is a safety guardrail that validates the new fleet’s functionality before shifting all of your traffic to the new fleet. You still have the benefits of a blue/green deployment, and the added canary feature lets you ensure that your new (green) fleet can serve inference before letting it handle 100% of the traffic.

The portion of your green fleet that turns on to receive traffic is called the canary, and you can choose the size of this canary. Note that the canary size should be less than or equal to 50% of the new fleet's capacity. Once the baking period finishes and no pre-specified Amazon CloudWatch alarms trip, the rest of the traffic shifts from the old (blue) fleet to the green fleet. Canary traffic shifting provides you with more safety during your deployment since any issues with the updated model only impact the canary.

The following diagram shows how canary traffic shifting manages the distribution of traffic between the blue and green fleets.

![\[A successful two step canary traffic shift from the old fleet to the new fleet.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deployment-guardrails-blue-green-canary.png)


Once SageMaker AI provisions the green fleet, SageMaker AI routes a portion of the incoming traffic (for example, 25%) to the canary. Then the baking period begins, during which your CloudWatch alarms monitor the performance of the green fleet. During this time, both the blue fleet and green fleet are partially active and receiving traffic. If any of the alarms trip during the baking period, then SageMaker AI initiates a rollback and all traffic returns to the blue fleet. If none of the alarms trip, then all of the traffic shifts to the green fleet and there is a final baking period. If the final baking period finishes without tripping any alarms, then the green fleet serves all traffic and SageMaker AI terminates the blue fleet.

## Prerequisites
<a name="deployment-guardrails-blue-green-canary-prereqs"></a>

Before setting up a deployment with canary traffic shifting, you must create Amazon CloudWatch alarms to monitor metrics from your endpoint. The alarms are active during the baking period, and if any alarms trip, then all endpoint traffic rolls back to the blue fleet. To learn how to set up CloudWatch alarms on an endpoint, see the prerequisite page [Auto-Rollback Configuration and Monitoring](deployment-guardrails-configuration.md). To learn more about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

## Configure Canary Traffic Shifting
<a name="deployment-guardrails-blue-green-canary-configure"></a>

Once you are ready for your deployment and have set up Amazon CloudWatch alarms for your endpoint, you can use either the Amazon SageMaker AI [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API or the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command in the AWS CLI to initiate the deployment.

**Topics**
+ [

### How to update an endpoint (API)
](#deployment-guardrails-blue-green-canary-configure-api-update)
+ [

### How to update an endpoint with an existing blue/green update policy (API)
](#deployment-guardrails-blue-green-canary-configure-api-existing)
+ [

### How to update an endpoint (CLI)
](#deployment-guardrails-blue-green-canary-configure-cli-update)

### How to update an endpoint (API)
<a name="deployment-guardrails-blue-green-canary-configure-api-update"></a>

The following example of the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API shows how you can update an endpoint with canary traffic shifting.

```
import boto3
client = boto3.client("sagemaker")

response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    DeploymentConfig={
        "BlueGreenUpdatePolicy": {
            "TrafficRoutingConfiguration": {
                "Type": "CANARY",
                "CanarySize": {
                    "Type": "CAPACITY_PERCENT",
                    "Value": 30
                },
                "WaitIntervalInSeconds": 600
            },
            "TerminationWaitInSeconds": 600,
            "MaximumExecutionTimeoutInSeconds": 1800
        },
        "AutoRollbackConfiguration": {
            "Alarms": [
                {
                    "AlarmName": "<your-cw-alarm>"
                }
            ]
        }
    }
)
```

To configure the canary traffic shifting option, do the following:
+ For `EndpointName`, use the name of the existing endpoint you want to update.
+ For `EndpointConfigName`, use the name of the endpoint configuration you want to use.
+ Under `DeploymentConfig` and `BlueGreenUpdatePolicy`, in `TrafficRoutingConfiguration`, set the `Type` parameter to `CANARY`. This specifies that the deployment uses canary traffic shifting.
+ In the `CanarySize` field, you can change the size of the canary by modifying the `Type` and `Value` parameters. For `Type`, use `CAPACITY_PERCENT`, meaning the percentage of your green fleet you want to use as the canary, and then set `Value` to `30`. In this example, you use 30% of the green fleet’s capacity as the canary. Note that the canary size should be equal to or less than 50% of the green fleet's capacity.
+ For `WaitIntervalInSeconds`, use `600`. The parameter tells SageMaker AI to wait for the specified amount of time (in seconds) between each interval shift. This interval is the duration of the canary baking period. In the preceding example, SageMaker AI waits for 10 minutes after the canary shift and then completes the second and final traffic shift.
+ For `TerminationWaitInSeconds`, use `600`. This parameter tells SageMaker AI to wait for the specified amount of time (in seconds) after your green fleet is fully active before terminating the instances in the blue fleet. In this example, SageMaker AI waits for 10 minutes after the final baking period before terminating the blue fleet.
+ For `MaximumExecutionTimeoutInSeconds`, use `1800`. This parameter sets the maximum amount of time that the deployment can run before it times out. In the preceding example, your deployment has a limit of 30 minutes to finish.
+ In `AutoRollbackConfiguration`, within the `Alarms` field, you can add your CloudWatch alarms by name. Create one `AlarmName: <your-cw-alarm>` entry for each alarm you want to use.

### How to update an endpoint with an existing blue/green update policy (API)
<a name="deployment-guardrails-blue-green-canary-configure-api-existing"></a>

When you use the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create an endpoint, you can optionally specify a deployment configuration to reuse for future endpoint updates. You can use the same `DeploymentConfig` options as the previous UpdateEndpoint API example. There are no changes to the CreateEndpoint API behavior. Specifying the deployment configuration does not automatically perform a blue/green update on your endpoint.

The option to use a previous deployment configuration happens when using the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API to update your endpoint. When updating your endpoint, you can use the `RetainDeploymentConfig` option to keep the deployment configuration you specified when you created the endpoint.

When calling the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, set `RetainDeploymentConfig` to `True` to keep the `DeploymentConfig` options from your original endpoint configuration.

```
response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    RetainDeploymentConfig=True
)
```

### How to update an endpoint (CLI)
<a name="deployment-guardrails-blue-green-canary-configure-cli-update"></a>

If you are using the AWS CLI, the following example shows how to start a blue/green canary deployment using the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command.

```
update-endpoint
--endpoint-name <your-endpoint-name>
--endpoint-config-name <your-config-name> 
--deployment-config '"BlueGreenUpdatePolicy": {"TrafficRoutingConfiguration": {"Type": "CANARY",
    "CanarySize": {"Type": "CAPACITY_PERCENT", "Value": 30}, "WaitIntervalInSeconds": 600},
    "TerminationWaitInSeconds": 600, "MaximumExecutionTimeoutInSeconds": 1800},
    "AutoRollbackConfiguration": {"Alarms": [{"AlarmName": "<your-alarm>"}]}'
```

To configure the canary traffic shifting option, do the following:
+ For `endpoint-name`, use the name of the endpoint you want to update.
+ For `endpoint-config-name`, use the name of the endpoint configuration you want to use.
+ For `deployment-config`, use a [BlueGreenUpdatePolicy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BlueGreenUpdatePolicy.html) JSON object.

**Note**  
If you would rather save your JSON object in a file, see [Generating AWS CLI skeleton and input parameters](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-skeleton.html) in the *AWS CLI User Guide*.

# Use linear traffic shifting
<a name="deployment-guardrails-blue-green-linear"></a>

Linear traffic shifting enables you to gradually shift traffic from your old fleet (blue fleet) to your new fleet (green fleet). With linear traffic shifting, you can shift traffic in multiple steps, minimizing the chance of a disruption to your endpoint. This blue/green deployment option gives you the most granular control over traffic shifting.

You can choose either the number of instances or the percentage of the green fleet’s capacity to activate during each step. Each linear step should only be between 10-50% of the green fleet's capacity. For each step, there is a baking period during which your pre-specified Amazon CloudWatch alarms monitor metrics on the green fleet. Once the baking period finishes and no alarms trip, the active portion of your green fleet continues receiving traffic and a new step begins. If alarms trip during any of the baking periods, 100% of the endpoint traffic rolls back to the blue fleet.

The following diagram shows how linear traffic shifting routes traffic to the blue and green fleets.

![\[A successful three step linear traffic shift from the old fleet to the new fleet.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deployment-guardrails-blue-green-linear.png)


Once SageMaker AI provisions the new fleet, the first portion of the green fleet turns on and receives traffic. SageMaker AI deactivates the same size portion of the blue fleet, and the baking period begins. If any alarms trip, all of the endpoint traffic rolls back to the blue fleet. If the baking period finishes, then the next step begins. Another portion of the green fleet activates and receives traffic, part of the blue fleet deactivates, and another baking period begins. The same process repeats until the blue fleet is fully deactivated and the green fleet is fully active and receiving all traffic. If an alarm goes off at any point, SageMaker AI terminates the shifting process and 100% of the traffic rolls back to the blue fleet.

## Prerequisites
<a name="deployment-guardrails-blue-green-linear-prereqs"></a>

Before setting up a deployment with linear traffic shifting, you must create CloudWatch alarms to monitor metrics from your endpoint. The alarms are active during the baking period, and if any alarms trip, then all endpoint traffic rolls back to the blue fleet. To learn how to set up CloudWatch alarms on an endpoint, see the prerequisite page [Auto-Rollback Configuration and Monitoring](deployment-guardrails-configuration.md). To learn more about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

## Configure Linear Traffic Shifting
<a name="deployment-guardrails-blue-green-linear-configure"></a>

Once you are ready for your deployment and have set up CloudWatch alarms for your endpoint, you can use either the Amazon SageMaker AI [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API or the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command in the AWS CLI to initiate the deployment.

**Topics**
+ [

### How to update an endpoint (API)
](#deployment-guardrails-blue-green-linear-configure-api-update)
+ [

### How to update an endpoint with an existing blue/green update policy (API)
](#deployment-guardrails-blue-green-linear-configure-api-existing)
+ [

### How to update an endpoint (CLI)
](#deployment-guardrails-blue-green-canary-configure-cli-update)

### How to update an endpoint (API)
<a name="deployment-guardrails-blue-green-linear-configure-api-update"></a>

The following example of the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API shows how you can update an endpoint with linear traffic shifting.

```
import boto3
client = boto3.client("sagemaker")

response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    DeploymentConfig={
        "BlueGreenUpdatePolicy": {
            "TrafficRoutingConfiguration": {
                "Type": "LINEAR",
                "LinearStepSize": {
                    "Type": "CAPACITY_PERCENT",
                    "Value": 20
                },
                "WaitIntervalInSeconds": 300
            },
            "TerminationWaitInSeconds": 300,
            "MaximumExecutionTimeoutInSeconds": 3600
        },
        "AutoRollbackConfiguration": {
            "Alarms": [
                {
                    "AlarmName": "<your-cw-alarm>"
                }
            ]
        }
    }
)
```

To configure the linear traffic shifting option, do the following:
+ For `EndpointName`, use the name of the existing endpoint you want to update.
+ For `EndpointConfigName`, use the name of the endpoint configuration you want to use.
+ Under `DeploymentConfig` and `BlueGreenUpdatePolicy`, in `TrafficRoutingConfiguration`, set the `Type` parameter to `LINEAR`. This specifies that the deployment uses linear traffic shifting.
+ In the `LinearStepSize` field, you can change the size of the steps by modifying the `Type` and `Value` parameters. For `Type`, use `CAPACITY_PERCENT`, meaning the percentage of your green fleet you want to use as the step size, and then set `Value` to `20`. In this example, you turn on 20% of the green fleet’s capacity for each traffic shifting step. Note that when customizing your linear step size, you should only use steps that are 10-50% of the green fleet's capacity.
+ For `WaitIntervalInSeconds`, use `300`. The parameter tells SageMaker AI to wait for the specified amount of time (in seconds) between each traffic shift. This interval is the duration of the baking period between each linear step. In the preceding example, SageMaker AI waits for 5 minutes between each traffic shift.
+ For `TerminationWaitInSeconds`, use `300`. This parameter tells SageMaker AI to wait for the specified amount of time (in seconds) after your green fleet is fully active before terminating the instances in the blue fleet. In this example, SageMaker AI waits for 5 minutes after the final baking period before terminating the blue fleet.
+ For `MaximumExecutionTimeoutInSeconds`, use `3600`. This parameter sets the maximum amount of time that the deployment can run before it times out. In the preceding example, your deployment has a limit of 1 hour to finish.
+ In `AutoRollbackConfiguration`, within the `Alarms` field, you can add your CloudWatch alarms by name. Create one `AlarmName: <your-cw-alarm>` entry for each alarm you want to use.

### How to update an endpoint with an existing blue/green update policy (API)
<a name="deployment-guardrails-blue-green-linear-configure-api-existing"></a>

When you use the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create an endpoint, you can optionally specify a deployment configuration to reuse for future endpoint updates. You can use the same `DeploymentConfig` options as the previous UpdateEndpoint API example. There are no changes to the CreateEndpoint API behavior. Specifying the deployment configuration does not automatically perform a blue/green update on your endpoint.

The option to use a previous deployment configuration happens when using the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API to update your endpoint. When updating your endpoint, you can use the `RetainDeploymentConfig` option to keep the deployment configuration you specified when you created the endpoint.

When calling the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, set `RetainDeploymentConfig` to `True` to keep the `DeploymentConfig` options from your original endpoint configuration.

```
response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    RetainDeploymentConfig=True
)
```

### How to update an endpoint (CLI)
<a name="deployment-guardrails-blue-green-canary-configure-cli-update"></a>

If you are using the AWS CLI, the following example shows how to start a blue/green linear deployment using the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command.

```
update-endpoint
--endpoint-name <your-endpoint-name>
--endpoint-config-name <your-config-name> 
--deployment-config '{"BlueGreenUpdatePolicy": {"TrafficRoutingConfiguration": {"Type": "LINEAR",
    "LinearStepSize": {"Type": "CAPACITY_PERCENT", "Value": 20}, "WaitIntervalInSeconds": 300},
    "TerminationWaitInSeconds": 300, "MaximumExecutionTimeoutInSeconds": 3600},
    "AutoRollbackConfiguration": {"Alarms": [{"AlarmName": "<your-alarm>"}]}'
```

To configure the linear traffic shifting option, do the following:
+ For `endpoint-name`, use the name of the endpoint you want to update.
+ For `endpoint-config-name`, use the name of the endpoint configuration you want to use.
+ For `deployment-config`, use a [BlueGreenUpdatePolicy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BlueGreenUpdatePolicy.html) JSON object.

**Note**  
If you would rather save your JSON object in a file, see [Generating AWS CLI skeleton and input parameters](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-skeleton.html) in the *AWS CLI User Guide*.

# Use rolling deployments
<a name="deployment-guardrails-rolling"></a>

When you update your endpoint, you can specify a rolling deployment to gradually shift traffic from your old fleet to a new fleet. You can control the size of the traffic shifting steps, as well as specify an evaluation period to monitor the new instances for issues before terminating instances from the old fleet. With rolling deployments, instances on the old fleet are cleaned up after each traffic shift to the new fleet, reducing the amount of additional instances needed to update your endpoint. This is useful especially for accelerated instances that are in high demand.

Rolling deployments gradually replace the previous deployment of your model version with the new version by updating your endpoint in configurable batch sizes. The traffic shifting behavior of rolling deployments is similar to the [linear traffic shifting mode](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-blue-green-linear.html) in blue/green deployments, but rolling deployments provide you with the benefit of reduced capacity requirements when compared to blue/green deployments. With rolling deployments, fewer instances are active at a time, and you have more granular control over how many instances you want to update in the new fleet. You should consider using a rolling deployment instead of a blue/green deployment if you have large models or a large endpoint with many instances.

The following list describes the key features of rolling deployments in Amazon SageMaker AI:
+ **Baking period. **The baking period is a set amount of time to monitor the new fleet before proceeding to the next deployment stage. If any of the pre-specified alarms trip during any baking period, then all endpoint traffic rolls back to the old fleet. The baking period helps you to build confidence in your update before making the traffic shift permanent.
+ **Rolling batch size.** You have granular control over the size of each batch for traffic shifting, or the number of instances you want to update in each batch. This number can range for 5–50% of the size of your fleet. You can specify the batch size as a number of instances or as the overall percentage of your fleet.
+ **Auto-rollbacks. **You can specify Amazon CloudWatch alarms that SageMaker AI uses to monitor the new fleet. If an issue with the updated code trips any of the alarms, SageMaker AI initiates an auto-rollback to the old fleet in order to maintain availability, thereby minimizing risk.

**Note**  
If your endpoint uses any of the features listed in the [Exclusions](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-exclusions.html) page, you cannot use rolling deployments.

## How it works
<a name="deployment-guardrails-rolling-how-it-works"></a>

During a rolling deployment, SageMaker AI provides the infrastructure to shift traffic from the old fleet to the new fleet without having to provision all of the new instances at once. SageMaker AI uses the following steps to shift traffic:

1. SageMaker AI provisions the first batch of instances in the new fleet.

1. A portion of traffic is shifted from the old instances to the first batch of new instances.

1. After the baking period, if no Amazon CloudWatch alarms are tripped, then SageMaker AI cleans up a batch of old instances.

1. SageMaker AI continues to provision, shift, and clean up instances in batches until the deployment is complete.

If an alarm is tripped during one of the baking periods, then traffic is rolled back to the old fleet in batches of a size that you specify. Alternatively, you can specify the rolling deployment to shift 100% of the traffic back to the old fleet if an alarm is tripped.

The following diagram shows the progression of a successful rolling deployment, as described in the previous steps.

![\[The steps of a rolling deployment's traffic shifting successfully from the old to the new fleet.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deployment-guardrails-rolling-diagram.png)


To create a rolling deployment, you only have to specify your desired deployment configuration. Then SageMaker AI handles provisioning new instances, terminating old instances, and shifting traffic for you. You can create and manage your deployment through the existing [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) and [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) SageMaker API and AWS Command Line Interface commands.

## Prerequisites
<a name="deployment-guardrails-prereqs"></a>

Before setting up a rolling deployment, you must create Amazon CloudWatch alarms to watch metrics from your endpoint. If any of the alarms trip during the baking period, then the traffic begins rolling back to your old fleet. To learn how to set up CloudWatch alarms on an endpoint, see the prerequisite page [Auto-Rollback Configuration and Monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-configuration.html). To learn more about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

Also, review the [Exclusions](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-exclusions.html) page to make sure that your endpoint meets the requirements for a rolling deployment.

## Determine the rolling batch size
<a name="deployment-guardrails-rolling-batch-size"></a>

Before updating your endpoint, determine the batch size that you want to use for incrementally shifting traffic to the new fleet.

For rolling deployments, you can specify a batch size that is 5–50% of the capacity of your fleet. If you choose a large batch size, the deployment completes more quickly. However, keep in mind that the endpoint requires more capacity while updating, roughly the batch size overhead. If you choose a smaller batch size, the deployment takes longer, but you use less capacity during the deployment.

## Configure a rolling deployment
<a name="deployment-guardrails-rolling-configure"></a>

Once you are ready for your deployment and have set up CloudWatch alarms for your endpoint, you can use the SageMaker AI [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API or the [update-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-endpoint.html) command in the AWS Command Line Interface to initiate the deployment.

**How to update an endpoint**

The following example shows how you can update your endpoint with a rolling deployment using the [update\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_endpoint.html) method of the Boto3 SageMaker AI client.

To configure a rolling deployment, use the following example and fields:
+ For `EndpointName`, use the name of the existing endpoint you want to update.
+ For `EndpointConfigName`, use the name of the endpoint configuration you want to use.
+ In the `AutoRollbackConfiguration` object, within the `Alarms` field, you can add your CloudWatch alarms by name. Create one `AlarmName: <your-cw-alarm>` entry for each alarm you want to use.
+ Under `DeploymentConfig`, for the `RollingUpdatePolicy` object, specify the following fields:
  + `MaximumExecutionTimeoutInSeconds` — The time limit for the total deployment. Exceeding this limit causes a timeout. The maximum value you can specify for this field is 28800 seconds, or 8 hours.
  + `WaitIntervalInSeconds` — The length of the baking period, during which SageMaker AI monitors alarms for each batch on the new fleet.
  + `MaximumBatchSize` — Specify the `Type` of batch you want to use (either instance count or overall percentage of your fleet) and the `Value`, or the size of each batch.
  + `RollbackMaximumBatchSize` — Use this object to specify the rollback strategy in case an alarm trips. Specify the `Type` of batch you want to use (either instance count or overall percentage of your fleet), and the `Value`, or the size of each batch. If you don’t specify these fields, or if you set the value to 100% of your endpoint, then SageMaker AI uses a blue/green rollback strategy and rolls all traffic back to the old fleet when an alarm trips.

```
import boto3
client = boto3.client("sagemaker")

response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-config-name>",
    DeploymentConfig={
        "AutoRollbackConfiguration": {
            "Alarms": [
                {
                    "AlarmName": "<your-cw-alarm>"
                },
            ]
        },
        "RollingUpdatePolicy": { 
            "MaximumExecutionTimeoutInSeconds": number,
            "WaitIntervalInSeconds": number,
            "MaximumBatchSize": {
                "Type": "INSTANCE_COUNT" | "CAPACITY_PERCENTAGE" (default),
                "Value": number
            },
            "RollbackMaximumBatchSize": {
                "Type": "INSTANCE_COUNT" | "CAPACITY_PERCENTAGE" (default),
                "Value": number
            },
        }  
    }
)
```

After updating your endpoint, you might want to check the status of your rolling deployment and check the health of your endpoint. You can review your endpoint’s status in the SageMaker AI console, or you can review the status of your endpoint by using the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API.

In the `VariantStatus` object returned by the `DescribeEndpoint` API, the `Status` field tells you the current deployment or operational status of your endpoint. For more information about the possible statuses and what they mean, see [ProductionVariantStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariantStatus.html).

If you attempted to do a rolling deployment and the status of your endpoint is `UpdateRollbackFailed`, see the following section for troubleshooting help.

## Failure handling
<a name="deployment-guardrails-rolling-failures"></a>

If your rolling deployments fails and the auto-rollback fails as well, your endpoint can be left with a status of `UpdateRollbackFailed`. This status means that different endpoint configurations are deployed to the instances behind your endpoint, and your endpoint is in service with a mix of old and new endpoint configurations.

You can make another call to the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API to return your endpoint to a healthy state. Specify your desired endpoint configuration and deployment configuration (either as a rolling deployment, a blue/green deployment, or neither) to update your endpoint.

You can call the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API to check the health of your endpoint again, which is returned in the `VariantStatus` object as the `Status` field. If your update is successful, your endpoint’s `Status` returns to `InService`.

# Exclusions
<a name="deployment-guardrails-exclusions"></a>

When doing a blue/green or rolling deployment, your new endpoint configuration must have the same variant name as the old endpoint configuration. There are also feature-based exclusions that make your endpoint incompatible with deployment guardrails at this time. If your endpoint uses any of the following features, you cannot use deployment guardrails on your endpoint, and your endpoint will fall back to using a blue/green deployment with all at once traffic shifting and no final baking period:
+ Marketplace containers
+ Endpoints that use Inf1 (Inferentia-based) instances

If you're doing a rolling deployment, there are additional feature-based exclusions:
+ Serverless inference endpoints
+ Multi-variant inference endpoints

# Shadow tests
<a name="shadow-tests"></a>

 With Amazon SageMaker AI you can evaluate any changes to your model serving infrastructure by comparing its performance against the currently deployed infrastructure. This practice is known as shadow testing. Shadow testing can help you catch potential configuration errors and performance issues before they impact end users. With SageMaker AI, you don't need to invest in building your shadow testing infrastructure, so you can focus on model development. 

 You can use this capability to validate changes to any component of your production variant, namely the model, the container, or the instance, without any end user impact. It is useful in situations including but not limited to the following: 
+  You are considering promoting a new model that has been validated offline to production, but want to evaluate operational performance metrics such as latency and error rate before making this decision. 
+  You are considering changes to your serving infrastructure container, such as patching vulnerabilities or upgrading to newer versions, and want to assess the impact of these changes prior to promotion to production. 
+  You are considering changing your ML instance and want to evaluate how the new instance would perform with live inference requests. 

 The SageMaker AI console provides a guided experience to manage the workflow of shadow testing. You can set up shadow tests for a predefined duration of time, monitor the progress of the test through a live dashboard, clean up upon completion, and act on the results. Select a production variant you want to test against, and SageMaker AI automatically deploys the new variant in shadow mode and routes a copy of the inference requests to it in real time within the same endpoint. Only the responses of the production variant are returned to the calling application. You can choose to discard or log the responses of the shadow variant for offline comparison. For more information on production and shadow variants, see [Validation of models in production](model-validation.md). 

 See [Create a shadow test](shadow-tests-create.md) for instructions on creating a shadow test. 

**Note**  
 Certain endpoint features may make your endpoint incompatible with shadow tests. If your endpoint uses any of the following features, you cannot use shadow tests on your endpoint, and your request to set up shadow tests will lead to validation errors.   
Serverless inference
Asynchronous inference
Marketplace containers
Multiple-container endpoints
Multi-model endpoints
Endpoints that use Inf1 (Inferentia-based) instances

# Create a shadow test
<a name="shadow-tests-create"></a>

 You can create a shadow test to compare the performance of a shadow variant against a production variant. You can run the test on an existing endpoint that is serving inference requests or you can create a new endpoint on which to run the test. 

 To create a shadow test you need to specify the following: 
+  A *production variant* that receives and responds to 100 percent of the incoming inference requests. 
+  A *shadow variant* that receives a percentage of the incoming requests, replicated from the production variant, but does not return any responses. 

 For each variant, you can use SageMaker AI to control the model, instance type, and instance count. You can configure the percentage of incoming requests, known as the traffic sampling percentage, that you want replicated to your shadow variant. SageMaker AI manages the replication of requests to your shadow variant and you can modify the traffic sampling percentage when your test is scheduled or running. You can also optionally turn on Data Capture to log requests and responses of your production and shadow variants. 

**Note**  
 SageMaker AI supports a maximum of one shadow variant per endpoint. For an endpoint with a shadow variant, there can be a maximum of one production variant. 

 You can schedule the test to start at any time and continue for a specified duration. The default duration is 7 days and the maximum is 30 days. After the test is complete, the endpoint reverts to the state it was in prior to starting the test. This ensures that you do not have to manually clean up resources upon the completion of the test. 

 You can monitor a test that is running through a dashboard in the SageMaker AI console. The dashboard provides a side by side comparison of invocation metrics and instance metrics between the production and shadow variants, along with a tabular view with relevant metric statistics. This dashboard is also available for completed tests. Once you have reviewed the metrics, you can either choose to promote the shadow variant to be the new production variant or retain the existing production variant. Once you promote the shadow variant, it responds to all incoming requests. For more information, see [Promote a shadow variant](shadow-tests-complete.md#shadow-tests-complete-promote). 

 The following procedure describes how to create a shadow test through the SageMaker AI console. There are variations in the workflow depending on whether you want to use an existing endpoint or to create a new endpoint for the shadow test. 

**Topics**
+ [

## Prerequisites
](#shadow-tests-create-prerequisites)
+ [

## Enter shadow test details
](#shadow-tests-create-console-shadow-test-details)
+ [

## Enter shadow test settings
](#shadow-tests-create-console-shadow-test-settings)

## Prerequisites
<a name="shadow-tests-create-prerequisites"></a>

 Before creating a shadow test with the SageMaker AI console, you must have a SageMaker AI model ready to use. For more information about how to create a SageMaker AI model, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md). 

 You can get started with shadow tests with an existing endpoint with a production variant and a shadow variant, an existing endpoint with only a production variant, or just the SageMaker AI models you'd like to compare. Shadow tests support creating an endpoint and adding variants before your test begins. 

**Note**  
 Certain endpoint features may make your endpoint incompatible with shadow tests. If your endpoint uses any of the following features, you cannot use shadow tests on your endpoint, and your request to set up shadow tests will lead to validation errors.   
Serverless inference
Asynchronous inference
Marketplace containers
Multiple-container endpoints
Multi-model endpoints
Endpoints that use Inf1 (Inferentia-based) instances

## Enter shadow test details
<a name="shadow-tests-create-console-shadow-test-details"></a>

 To start creating your shadow test, fill out the **Enter shadow test details** page by doing the following: 

1.  Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the left navigation panel, choose **Inference**, and then choose **Shadow tests**. 

1.  Choose **Create shadow test**. 

1.  Under **Name**, enter a name for the test. 

1.  (Optional) Under **Description**, enter a description for the test. 

1.  (Optional) Specify **Tags** using **Key** and **Value** pairs. 

1.  Choose **Next**. 

## Enter shadow test settings
<a name="shadow-tests-create-console-shadow-test-settings"></a>

 After filling out the **Enter shadow test details** page, fill out the **Enter shadow test settings** page. If you already have a SageMaker AI Inference endpoint and a production variant, follow the **Use an existing endpoint** workflow. If you don't already have an endpoint, follow the **Create a new endpoint** workflow. 

------
#### [ Use an existing endpoint ]

 If you want to use an existing endpoint for your test, fill out the **Enter shadow test settings** page by doing the following: 

1.  Choose a role that has the `AmazonSageMakerFullAccess` IAM policy attached. 

1.  Choose **Use an existing endpoint**, and then choose one of the available endpoints. 

1.  (Optional) To encrypt the storage volume on your endpoint, either choose an existing KMS key or choose **Enter a KMS key ARN** from the dropdown list under **Encryption key**. If you choose the second option, a field to enter the KMS key ARN appears. Enter the KMS key ARN in that field. 

1.  If you have multiple production variants behind that endpoint, remove the ones you don't want to use for the test. You can remove a model variant by selecting it and then choosing **Remove**. 

1.  If you do not already have a shadow variant, add a shadow variant. To add a shadow variant, do the following: 

   1.  Choose **Add**. 

   1.  Choose **Shadow variant**. 

   1.  In the **Add model** dialog box, choose the model you want to use for your shadow variant. 

   1.  Choose **Save**. 

1.  (Optional) In the preceding step, the shadow variant is added with the default settings. To modify these settings, select the shadow variant and choose **Edit**. The **Edit shadow variant** dialog box appears. For more information on filling out this dialog box, see [Edit a shadow test](shadow-tests-view-monitor-edit-individual.md). 

1.  In the **Schedule** section, enter the duration of the test by doing the following: 

   1.  Choose the box under **Duration**. A popup calender appears. 

   1.  Select the start and end dates from the calender, or enter the start and end dates in the fields for **Start date** and **End date**, respectively. 

   1.  (Optional) For the fields **Start time** and **End time**, enter the start and end times, respectively, in the 24 hour format. 

   1.  Choose **Apply**. 

    The minimum duration is 1 hour, and the maximum duration is 30 days. 

1.  (Optional) Turn on **Enable data capture** to save inference request and response information from your endpoint to an Amazon S3 bucket, and then enter the location of the Amazon S3 bucket. 

1.  Choose **Create shadow test**. 

------
#### [ Create a new endpoint ]

 If you don't have an existing endpoint, or you want to create a new endpoint for your test, fill out the **Enter shadow test settings** page by doing the following: 

1.  Choose a role that has the `AmazonSageMakerFullAccess` IAM policy attached. 

1.  Choose **Create a new endpoint**. 

1.  Under **Name**, enter a name for the endpoint. 

1.  Add one production variant and one shadow variant to the endpoint: 
   +  To add a production variant choose **Add**, and then choose **Production variant**. In the **Add model** dialog box, choose the model you want to use for your production variant, and then choose **Save**. 
   +  To add a shadow variant choose **Add**, and then choose **Shadow variant**. In the **Add model** dialog box, choose the model you want to use for your shadow variant, and then choose **Save**. 

1.  (Optional) In the preceding step, the shadow variant is added with the default settings. To modify these settings, select the shadow variant and choose **Edit**. The **Edit shadow variant** dialog box appears. For more information on filling out this dialog box, see [Edit a shadow test](shadow-tests-view-monitor-edit-individual.md). 

1.  In the **Schedule** section, enter the duration of the test by doing the following: 

   1.  Choose the box under **Duration**. A popup calender appears. 

   1.  Select the start and end dates from the calender, or enter the start and end dates under **Start date** and **End date**, respectively. 

   1.  (Optional) Under **Start time** and **End time**, enter the start and end times, respectively, in the 24 hour format. 

   1.  Choose **Apply**. 

    The minimum duration is 1 hour, and the maximum duration is 30 days. 

1.  (Optional) Turn on **Enable data capture** to save inference request and response information from your endpoint to an Amazon S3 bucket, and then enter the location of the Amazon S3 bucket. 

1.  Choose **Create shadow test**. 

------

 After completing the preceding procedures, you should now have a test scheduled to begin at your specified start date and time. You can view the progress of the test from a dashboard. For more information about viewing your test and the actions you can take, see [How to view, monitor, and edit shadow tests](shadow-tests-view-monitor-edit.md). 

# How to view, monitor, and edit shadow tests
<a name="shadow-tests-view-monitor-edit"></a>

 You can view the statuses of your shadow tests, monitor their progress from a dashboard, and perform actions, such as starting or stopping an test early or deleting an test. The following topics show how you can view and modify your shadow tests using the SageMaker AI console. 

**Topics**
+ [

# View shadow tests
](shadow-tests-view-monitor-edit-list.md)
+ [

# Monitor a shadow test
](shadow-tests-view-monitor-edit-dashboard.md)
+ [

# Start a shadow test early
](shadow-tests-view-monitor-edit-start.md)
+ [

# Delete a shadow test
](shadow-tests-view-monitor-edit-delete.md)
+ [

# Edit a shadow test
](shadow-tests-view-monitor-edit-individual.md)

# View shadow tests
<a name="shadow-tests-view-monitor-edit-list"></a>

 You can view the statuses of all of your shadow tests on the **Shadow tests** page on the SageMaker AI console. 

 To view your tests in the console, do the following: 

1.  Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the navigation panel, choose **Inference**. 

1.  Choose **Shadow tests** to view the page that lists all of your shadow tests. The page should look like the following screenshot, with all the tests listed under the **Shadow test** section.   
![\[List of all shadow tests.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-test-landing-page.png)

 You can see the status of a test in the console on the **Shadow tests** page by checking the **Status** field for the test. 

 The following are the possible statuses for a test: 
+  `Creating` – SageMaker AI is creating your test. 
+  `Created` – SageMaker AI has finished creating your test, and it will begin at the scheduled time. 
+  `Updating` – When you make changes to your test, your test shows as updating. 
+  `Starting` – SageMaker AI is beginning your test. 
+  `Running` – Your test is in progress. 
+  `Stopping` – SageMaker AI is stopping your test. 
+  `Completed` – Your test has completed. 
+  `Cancelled` – When you conclude your test early, it shows as cancelled. 

# Monitor a shadow test
<a name="shadow-tests-view-monitor-edit-dashboard"></a>

 You can view the details of a shadow test and monitor it while it is in progress or after it has completed. SageMaker AI presents a live dashboard comparing the operational metrics like model latency, and error rate aggregated, of the production and shadow variants. 

 To view the details of an individual test in the console, do the following: 

1.  Select the test you want to monitor from the **Shadow test** section on the **Shadow tests** page. 

1.  From the **Actions** dropdown list, choose **View**. An overview page with the details of the test and a metrics dashboard appears. 

The overview page has the following three sections.

**Summary**  
 This section summarizes the progress and status of the test. It also shows the summary statistics of the metric chosen from the **Select metric** dropdown list in the **Metrics** subsection. The following screenshot shows this section.   

![\[Summary section of the overview page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-test-summary.png)

 In the preceding screenshot, the **Settings**, and **Details** tabs show the settings that you selected, and the details that you entered when creating the test. 

**Analysis**  
 This section shows a metrics dashboard with separate graphs for the following metrics:   
+ `Invocations`
+ `InvocationsPerInstance`
+ `ModelLatency`
+ `Invocation4XXErrors`
+ `Invocation5XXErrors`
+ `InvocationModelErrors`
+ `CPUUtilization`
+ `MemoryUtilization`
+ `DiskUtilization`
 The last three metrics monitor the model container runtime resource usage. The rest are CloudWatch metrics that you can use to analyse the performance of your variant. In general, fewer errors indicate a more stable model. A lower latency indicates either a faster model or a faster infrastructure. For more information about CloudWatch metrics, see [SageMaker AI endpoint invocation metrics](monitoring-cloudwatch.md#cloudwatch-metrics-endpoint-invocation). The following screenshot shows the metrics dashboard.   

![\[Metrics analysis dashboard.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-test-analysis.png)


**Environment**  
 This section shows the variants that you compared in the test. If you are satisfied by the performance of the shadow variant, based on the aforementioned metrics, you can promote the shadow variant to production, by choosing **Deploy shadow variant**. For more details about deploying a shadow variant, see [Promote a shadow variant](shadow-tests-complete.md#shadow-tests-complete-promote). You can also change the traffic sampling percentage, and continue testing, by choosing **Edit traffic**. For more details about editing a shadow variant, see [Edit a shadow test](shadow-tests-view-monitor-edit-individual.md). The following screenshot shows this section.   

![\[Environment section of the overview page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-test-environment.png)


# Start a shadow test early
<a name="shadow-tests-view-monitor-edit-start"></a>

 You can start your test before its scheduled start time. If the new duration of the test exceeds 30 days, SageMaker AI automatically sets the end of the test to 30 days after the new start time. This action starts the test immediately. If you want to change the start or end time of the test, see [Edit a shadow test](shadow-tests-view-monitor-edit-individual.md). 

 To immediately start your test, before its scheduled start time, through the console, do the following: 

1.  Select the test you want to start immediately from the **Shadow test** section on the **Shadow tests** page. 

1.  From the **Actions** dropdown list, choose **Start**. The **Start shadow test?** dialog box appears. 

1.  Choose **Start now**. 

# Delete a shadow test
<a name="shadow-tests-view-monitor-edit-delete"></a>

 You can delete a test that you no longer need. Deleting your test only deletes the test metadata and not your endpoint, variants, or data captured in Amazon S3. If you want your endpoint to stop running, you must delete your endpoint. For more information about deleting an endpoint, see [Delete Endpoints and Resources](realtime-endpoints-delete-resources.md) 

 To delete a test through the console, do the following: 

1.  Select the test you want to delete from the **Shadow test** section on the **Shadow tests** page. 

1.  From the **Actions** dropdown list, choose **Delete**. The **Delete shadow test** dialog box appears. 

1.  In the **To confirm deletion, type *delete* in the field.** text box, enter **delete**. 

1.  Choose **Delete**. 

# Edit a shadow test
<a name="shadow-tests-view-monitor-edit-individual"></a>

 You can modify both scheduled and in-progress tests. Before your test starts, you can change the description, the shadow variant configuration, the start date, and the end date of the test. You can also turn on or turn off data capture. 

 After your test starts, you can only change the description, the traffic sampling percentage for the shadow variant, and the end date. 

 To edit the details of your test through the console, do the following: 

1.  Select the test you want to edit from the **Shadow test** section on the **Shadow tests** page. 

1.  From the **Actions** dropdown list, choose **Edit**. The **Enter shadow test details** page appears. 

1.  (Optional) Under **Description**, enter a description of your test. 

1.  Choose **Next**. The **Enter shadow test settings** page appears. 

1.  (Optional) To edit your shadow variant, do the following: 

   1.  Select the shadow variant and choose **Edit**. The **Edit shadow variant** dialog box appears. If your test has already started, then you can only change the traffic sampling percentage. 

   1.  (Optional) Under **Name**, enter the new name to replace the old name. 

   1.  (Optional) Under **Traffic sample**, enter the new traffic sampling percentage to replace the old traffic sampling percentage. 

   1.  (Optional) Under **Instance type**, select the new instance type from the dropdown list. 

   1.  (Optional) Under **Instance count**, enter the new instance count to replace the old instance count. 

   1.  Choose **Apply**. 

    You cannot change the model in your shadow variant using the above procedure. If you want to change the model, first remove the shadow variant by selecting it and choosing **Remove**. Then add a new shadow variant. 

1.  (Optional) To edit the duration of the test, do the following: 

   1.  Choose the box under **Duration** in the **Schedule** section. A popup calender appears. 

   1.  If your test is yet to start, you can change both the start and end dates. Select the new start and end dates from the calender, or enter the new start and end dates under **Start date** and **End date**, respectively. 

       If your test has already started, you can only change the end date. Enter the new end date under **End date**. 

   1.  (Optional) If your test is yet to start, you can change both the start and end times. Enter the new start and end times under **Start time**, and **End time**, respectively, in the 24 hour format. 

       If your test has already started, you can only change the end time. Enter the new end time under **End time**, in the 24 hour format. 

   1.  Choose **Apply**. 

1.  (Optional) Turn on or turn off **Enable data capture**. 

1.  Choose **Update shadow test**. 

# Complete a shadow test
<a name="shadow-tests-complete"></a>

 Your test automatically completes at the end of the scheduled duration, or you can stop an in-progress test early. After your test has completed, the test’s status in the **Shadow tests** section on the **Shadow tests** page shows as **Complete**. Then you can review and analyze the final metrics of your test. 

 You can use the metrics dashboard to decide whether to promote the shadow variant to production. For more information about analyzing the metrics dashboard of your test, see [Monitor a shadow test](shadow-tests-view-monitor-edit-dashboard.md). 

 For instructions on how to complete your test before the end of its scheduled completion time, see [Complete a shadow test early](#shadow-tests-complete-early). 

 For instructions on promoting your shadow variant to production, see [Promote a shadow variant](#shadow-tests-complete-promote). 

## Complete a shadow test early
<a name="shadow-tests-complete-early"></a>

 One reason you might want to complete an in-progress shadow test is if you’ve decided that the metrics for your shadow variant look good and you want to promote it to production. You might also decide to complete the test if one or more of the variants aren’t performing well. 

 To complete your test before its scheduled end date, do the following: 

1.  Select the test you want to mark complete from the **Shadow tests** section on the **Shadow tests** page. 

1.  From the **Actions** dropdown list, choose **Complete**, and the **Complete shadow test** dialog box appears. 

1.  In the dialog box, choose one of the following options: 
   + **Yes, deploy shadow variant**
   + **No, remove shadow variant**

1.  (Optional) In the **Comment** text box, enter your reason for completing the test before its scheduled end time. 

1. 

   1.  If you decided to deploy the shadow variant, choose **Complete and proceed to deploy**. The **Deploy shadow variant** page appears. For instructions on how to fill out this page, see [Promote a shadow variant](#shadow-tests-complete-promote). 

   1.  If you decide to remove the shadow variant, choose **Confirm**. 

## Promote a shadow variant
<a name="shadow-tests-complete-promote"></a>

 If you’ve decided that you want to replace your production variant with your shadow variant, you can update your endpoint and promote your shadow variant to respond to inference requests. This removes your current production variant from production and replaces it with your shadow variant. 

 If your shadow test is still in-progress, you must first complete your test. To complete your shadow test before its scheduled end, follow the instructions in [Complete a shadow test early](#shadow-tests-complete-early) before continuing with this section. 

 When you promote a shadow variant to production, you have the following options for the instance count of the shadow variant. 
+  You can retain the instance count and type from the production variant. If you select this option, then your shadow variant launches in production with the current instance count, ensuring that your model can continue to process request traffic at the same scale. 
+  You can retain the instance count and type of your shadow variant. If you want to use this option, we recommend that you shadow test with 100 percent traffic sampling to ensure that the shadow variant can process request traffic at the current scale. 
+  You can use custom values for the instance count and type. If you want to use this option, we recommend that you shadow test with 100 percent traffic sampling to ensure that the shadow variant can process request traffic at the current scale. 

 Unless you are validating the instance type or count or both of the shadow variant, we highly recommend that you retain the instance count and type from the production variant when promoting your shadow variant. 

 To promote your shadow variant, do the following: 

1.  If your test has completed, do the following: 

   1.  Select the test from the **Shadow test** section on the **Shadow tests** page. 

   1.  From the **Actions** dropdown list, choose **View**. The dashboard appears. 

   1.  Choose **Deploy shadow variant** in the **Environment** section. The **Deploy shadow variant** page appears. 

    If your test has not completed, see [Complete a shadow test early](#shadow-tests-complete-early) to complete it. 

1.  In the **Variant settings** section, select one of the following options: 
   + **Retain production settings**
   + **Retain shadow settings**
   + **Custom instance settings**

    If you selected **Custom instance settings**, do the following: 

   1.  Select the instance type from the **Instance type** dropdown list. 

   1.  Under **Instance count**, enter the number of instances. 

1.  In **Enter 'deploy' to confirm deployment** text box, enter **deploy**. 

1.  Choose **Deploy shadow variant**. 

 Your SageMaker AI Inference endpoint is now using the shadow variant as your production variant, and your production variant has been removed from the endpoint. 

# Best practices
<a name="shadow-tests-best-practices"></a>

 When creating an inference experiment, keep the following information in mind: 
+  **Traffic sampling percentage** – Sampling 100 percent of the inference requests lets you validate that your shadow variant can handle production traffic when promoted. You may start off with a lower traffic sampling percentage and dial up as you gain confidence in your variant, but it is best practice to ensure that you’ve increased the traffic to 100 percent prior to promotion. 
+  **Instance type** – Unless you are using shadow variants to evaluate alternate instance types or sizes, we recommend that you use the same instance type, size, and count so that you can be certain that your shadow variant can handle the volume of inference requests after you promote it. 
+  **Auto scaling** – To ensure that your shadow variant can respond to spikes in the number of inference requests or changes in inference requests patterns, we highly recommend that you configure autoscaling on your shadow variants. To learn how to configure autoscaling, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md). If you have configured autoscaling, you can also validate changes to autoscaling policies without causing impact to users. 
+  **Metrics monitoring** – After you initiate a shadow experiment and have sufficient invocations, monitor the metrics dashboard to ensure that the metrics such as latency and error rate are within acceptable bounds. This helps you catch misconfigurations early and take corrective action. For information about how to monitor the metrics of an in-progress inference experiment, see [How to view, monitor, and edit shadow tests](shadow-tests-view-monitor-edit.md). 

# Access containers through SSM
<a name="ssm-access"></a>

 Amazon SageMaker AI allows you to securely connect to the Docker containers on which your models are deployed on for Inference using AWS Systems Manager (SSM). This gives you shell level access to the container so that you can debug the processes running within the container and log commands and responses with Amazon CloudWatch. You can also set up an AWS PrivateLink connection to the ML instances that host your containers for accessing the containers via SSM privately. 

**Warning**  
 Enabling SSM access can impact the performance of your endpoint. We recommend using this feature with your dev or test endpoints and not with the endpoints in production. Also, SageMaker AI automatically applies security patches, and replaces or terminates faulty endpoint instances within 10 minutes. However for endpoints with SSM enabled production variants, SageMaker AI delays security patching and replacing or terminating faulty endpoint instances by a day, to allow you to debug. 

 The following sections detail how you can use this feature. 

## Allowlist
<a name="ssm-access-allowlist"></a>

 You have to contact customer support, and get your account allowlisted, to use this feature. You cannot create an endpoint with SSM access enabled, if your account is not allow listed for this access. 

## Enable SSM access
<a name="ssm-access-enable"></a>

 To enable SSM access for an existing container on an endpoint, update the endpoint with a new endpoint configuration, with the `EnableSSMAccess` parameter set to `true` The following example provides a sample endpoint configuration. 

```
{
    "EndpointConfigName": "endpoint-config-name",
    "ProductionVariants": [
        {
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1.0,
            "InstanceType": "ml.t2.medium",
            "ModelName": model-name,
            "VariantName": variant-name,
            "EnableSSMAccess": true,
        },
    ]
}
```

 For more information on enabling SSM access, see [EnableSSMAccess](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html#API_EnableSSMAccess). 

## IAM configuration
<a name="ssm-access-iam"></a>

### Endpoint IAM permissions
<a name="ssm-access-iam-endpoint"></a>

 If you have enabled SSM access for an endpoint instance, SageMaker AI starts and manages the [SSM agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html) when it initiates the endpoint instance. To allow the SSM agent to communicate with the SSM services, add the following policy to the execution role that the endpoint runs under. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*"    
        }
    ]
 }
```

------

### User IAM permissions
<a name="ssm-access-iam-user"></a>

 Add the following policy to give an IAM user SSM session permissions to connect to a SSM target. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

 You can restrict the endpoints that an IAM user can connect to, with the following policy. Replace the *italicized placeholder text* with your own information. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	  
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession"
            ],
            "Resource": "arn:aws:sagemaker:us-east-2:111122223333:endpoint/endpoint-name"    
        }
    ]
}
```

------

## SSM access with AWS PrivateLink
<a name="ssm-access-privatelink"></a>

 If your endpoints run within a virtual private cloud (VPC) that is not connected to the public internet, you can use AWS PrivateLink to enable SSM. AWS PrivateLink restricts all network traffic between your endpoint instances, SSM, and Amazon EC2 to the Amazon network. For more information on how to setup SSM access with AWS PrivateLink, see [Set up a VPC endpoint for Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html). 

## Logging with Amazon CloudWatch Logs
<a name="ssm-access-logging"></a>

 For SSM access enabled endpoints, you can log errors from the SSM agent with Amazon CloudWatch Logs. For more information on how to log errors with CloudWatch Logs, see [Logging session activity](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-logging.html). The log is available at the SSM log stream, `variant-name/ec2-instance-id/ssm`, under the endpoint log group `/aws/sagemaker/endpoints/endpoint-name`. For more information on how to view the log, see [View log data sent to CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#ViewingLogData). 

 Production variants behind your endpoint can have multiple model containers. The log for each model container is recorded in the log stream. Each log is preceded by `[sagemaker ssm logs][container-name]`, where `container-name` is either the name that you gave to the container, or the default name, such as `container_0`, and `container_1`. 

## Accessing model containers
<a name="ssm-access-container"></a>

 To access a model container on your endpoint instance, you need its target ID. The target ID is in one of the following formats: 
+  `sagemaker-endpoint:endpoint-name_variant-name_ec2-instance-id` for containers on single container endpoints 
+  `sagemaker-endpoint:endpoint-name_variant-name_ec2-instance-id_container-name` for containers on multi-container endpoints 

 The following example shows how you can use the AWS CLI to access a model container using its target ID. 

```
aws ssm start-session --target sagemaker-endpoint:prod-image-classifier_variant1_i-003a121c1b21a90a9_container_1
```

 If you enable logging, as mentioned in [Logging with Amazon CloudWatch Logs](#ssm-access-logging), you can find the target IDs for all the containers listed at the beginning of the SSM log stream. 

**Note**  
 You cannot connect to 1P algorithm containers or containers of models obtained from SageMaker AI MarketPlace with SSM. However you can connect to deep learning containers (DLCs) provided by AWS or any custom container that you own. 
 If you have enabled network isolation for a model container that prevents it from making outbound network calls, you cannot start an SSM session for that container. 
 You can only access one container from one SSM session. To access another container, even if it is behind the same endpoint, start a new SSM session with the target ID of that endpoint. 

# Model servers for model deployment with Amazon SageMaker AI
<a name="deploy-model-frameworks"></a>

You can use popular model servers, such as TorchServe, DJL Serving, and Triton Inference Server, to deploy your models on SageMaker AI. The following topics explain how.

**Topics**
+ [

# Deploy models with TorchServe
](deploy-models-frameworks-torchserve.md)
+ [

# Deploy models with DJL Serving
](deploy-models-frameworks-djl-serving.md)
+ [

# Model deployment with Triton Inference Server
](deploy-models-frameworks-triton.md)

# Deploy models with TorchServe
<a name="deploy-models-frameworks-torchserve"></a>

TorchServe is the recommended model server for PyTorch, preinstalled in the AWS PyTorch Deep Learning Container (DLC). This powerful tool offers customers a consistent and user-friendly experience, delivering high performance in deploying multiple PyTorch models across various AWS instances, including CPU, GPU, Neuron, and Graviton, regardless of the model size or distribution.

TorchServe supports a wide array of advanced features, including dynamic batching, microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX and IPEX. Moreover, it seamlessly integrates PyTorch's large model solution, PiPPy, enabling efficient handling of large models. Additionally, TorchServe extends its support to popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more, expanding its capabilities even further. With TorchServe, AWS users can confidently deploy and serve their PyTorch models, taking advantage of its versatility and optimized performance across various hardware configurations and model types. For more detailed information, you can refer to the [PyTorch documentation](https://pytorch.org/serve/) and [TorchServe on GitHub](https://github.com/pytorch/serve).

The following table lists the AWS PyTorch DLCs supported by TorchServe.


| Instance type | SageMaker AI PyTorch DLC link | 
| --- | --- | 
| CPU and GPU | [SageMaker AI PyTorch containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) | 
| Neuron | [PyTorch Neuron containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) | 
| Graviton | [SageMaker AI PyTorch Graviton containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-graviton-containers-sm-support-only) | 

The following sections describe the setup to build and test PyTorch DLCs on Amazon SageMaker AI.

## Getting started
<a name="deploy-models-frameworks-torchserve-prereqs"></a>

To get started, ensure that you have the following prerequisites:

1. Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AWSServiceRoleForAmazonEKSNodegroup](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSServiceRoleForAmazonEKSNodegroup)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

1. Locally configure your dependencies, as shown in the following example:

   ```
   from datetime import datetime
       import os
       import json
       import logging
       import time
       
       # External Dependencies:
       import boto3
       from botocore.exceptions import ClientError
       import sagemaker
       
       sess = boto3.Session()
       sm = sess.client("sagemaker")
       region = sess.region_name
       account = boto3.client("sts").get_caller_identity().get("Account")
       
       smsess = sagemaker.Session(boto_session=sess)
       role = sagemaker.get_execution_role()
       
       # Configuration:
       bucket_name = smsess.default_bucket()
       prefix = "torchserve"
       output_path = f"s3://{bucket_name}/{prefix}/models"
       print(f"account={account}, region={region}, role={role}")
   ```

1. Retrieve the PyTorch DLC image, as shown in the following example.

   SageMaker AI PyTorch DLC images are available in all AWS regions. For more information, see the [list of DLC container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).

   ```
   baseimage = sagemaker.image_uris.retrieve(
           framework="pytorch",
           region="<region>",
           py_version="py310",
           image_scope="inference",
           version="2.0.1",
           instance_type="ml.g4dn.16xlarge",
       )
   ```

1. Create a local workspace.

   ```
   mkdir -p workspace/
   ```

## Adding a package
<a name="deploy-models-frameworks-torchserve-package"></a>

The following sections describe how to add and preinstall packages to your PyTorch DLC image.

**BYOC use cases**

The following steps outline how to add a package to your PyTorch DLC image. For more information about customizing your container, see [Building AWS Deep Learning Containers Custom Images](https://github.com/aws/deep-learning-containers/blob/master/custom_images.md).

1. Suppose you want to add a package to the PyTorch DLC docker image. Create a Dockerfile under the `docker` directory, as shown in the following example:

   ```
   mkdir -p workspace/docker
       cat workspace/docker/Dockerfile
       
       ARG BASE_IMAGE
       
       FROM $BASE_IMAGE
       
       #Install any additional libraries
       RUN pip install transformers==4.28.1
   ```

1. Build and publish the customized docker image by using the following [ build\$1and\$1push.sh](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/workspace/docker/build_and_push.sh) script.

   ```
   # Download script build_and_push.sh to workspace/docker
       ls workspace/docker
       build_and_push.sh  Dockerfile
       
       # Build and publish your docker image
       reponame = "torchserve"
       versiontag = "demo-0.1"
       
       ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}
   ```

**SageMaker AI preinstall use cases**

The following example shows you how to preinstall a package to your PyTorch DLC container. You must create a `requirements.txt` file locally under the directory `workspace/code`.

```
mkdir -p workspace/code
    cat workspace/code/requirements.txt
    
    transformers==4.28.1
```

## Create TorchServe model artifacts
<a name="deploy-models-frameworks-torchserve-artifacts"></a>

In the following example, we use the pre-trained [ MNIST model](https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist). We create a directory `workspace/mnist`, implement [mnist\$1handler.py](https://github.com/pytorch/serve/blob/master/examples/image_classifier/mnist/mnist_handler.py) by following the [TorchServe custom service instructions](https://github.com/pytorch/serve/blob/master/docs/custom_service.md#custom-service), and [configure the model parameters](https://github.com/pytorch/serve/tree/master/model-archiver#config-file) (such as batch size and workers) in [model-config.yaml](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/workspace/lama/model-config.yaml). Then, we use the TorchServe tool `torch-model-archiver` to build the model artifacts and upload to Amazon S3.

1. Configure the model parameters in `model-config.yaml`.

   ```
   ls -al workspace/mnist-dev
       
       mnist.py
       mnist_handler.py
       mnist_cnn.pt
       model-config.yaml
       
       # config the model
       cat workspace/mnist-dev/model-config.yaml
       minWorkers: 1
       maxWorkers: 1
       batchSize: 4
       maxBatchDelay: 200
       responseTimeout: 300
   ```

1. Build the model artifacts by using [torch-model-archiver ](https://github.com/pytorch/serve/tree/master/model-archiver#torch-model-archiver-for-torchserve).

   ```
   torch-model-archiver --model-name mnist --version 1.0 --model-file workspace/mnist-dev/mnist.py --serialized-file workspace/mnist-dev/mnist_cnn.pt --handler workspace/mnist-dev/mnist_handler.py --config-file workspace/mnist-dev/model-config.yaml --archive-format tgz
   ```

   If you want to preinstall a package, you must include the `code` directory in the `tar.gz` file.

   ```
   cd workspace
       torch-model-archiver --model-name mnist --version 1.0 --model-file mnist-dev/mnist.py --serialized-file mnist-dev/mnist_cnn.pt --handler mnist-dev/mnist_handler.py --config-file mnist-dev/model-config.yaml --archive-format no-archive
       
       cd mnist
       mv ../code .
       tar cvzf mnist.tar.gz .
   ```

1. Upload `mnist.tar.gz` to Amazon S3.

   ```
   # upload mnist.tar.gz to S3
       output_path = f"s3://{bucket_name}/{prefix}/models"
       aws s3 cp mnist.tar.gz {output_path}/mnist.tar.gz
   ```

## Using single model endpoints to deploy with TorchServe
<a name="deploy-models-frameworks-torchserve-single-model"></a>

The following example shows you how to create a [single model real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html), deploy the model to the endpoint, and test the endpoint by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).

```
from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  predictor_cls = Predictor,
                  name = "mnist",
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    predictor = model.deploy(instance_type='ml.g4dn.xlarge',
                             initial_instance_count=1,
                             endpoint_name = endpoint_name,
                             serializer=JSONSerializer(),
                             deserializer=JSONDeserializer())  
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(dummy_data)
```

## Using multi-model endpoints to deploy with TorchServe
<a name="deploy-models-frameworks-torchserve-multi-model"></a>

[Multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) are a scalable and cost-effective solution to hosting large numbers of models behind one endpoint. They improve endpoint utilization by sharing the same fleet of resources and serving container to host all of your models. They also reduce deployment overhead because SageMaker AI manages dynamically loading and unloading models, as well as scaling resources based on traffic patterns. Multi-model endpoints are particularly useful for deep learning and generative AI models that require accelerated compute power.

By using TorchServe on SageMaker AI multi-model endpoints, you can speed up your development by using a serving stack that you are familiar with while leveraging the resource sharing and simplified model management that SageMaker AI multi-model endpoints provide.

The following example shows you how to create a multi-model endpoint, deploy the model to the endpoint, and test the endpoint by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). Additional details can be found in this [notebook example](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/torchserve_multi_model_endpoint.ipynb).

```
from sagemaker.multidatamodel import MultiDataModel
    from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    mme = MultiDataModel(
        name = endpoint_name,
        model_data_prefix = output_path,
        model = model,
        sagemaker_session = smsess)
    
    mme.deploy(
        initial_instance_count = 1,
        instance_type = "ml.g4dn.xlarge",
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.JSONDeserializer())
    
    # list models
    list(mme.list_models())
    
    # create mnist v2 model artifacts
    cp mnist.tar.gz mnistv2.tar.gz
    
    # add mnistv2
    mme.add_model(mnistv2.tar.gz)
    
    # list models
    list(mme.list_models())
    
    predictor = Predictor(endpoint_name=mme.endpoint_name, sagemaker_session=smsess)
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(date=dummy_data, target_model="mnist.tar.gz")
```

## Metrics
<a name="deploy-models-frameworks-torchserve-metrics"></a>

TorchServe supports both system level and model level metrics. You can enable metrics in either log format mode or Prometheus mode through the environment variable `TS_METRICS_MODE`. You can use the TorchServe central metrics config file `metrics.yaml` to specify the types of metrics to be tracked, such as request counts, latency, memory usage, GPU utilization, and more. By referring to this file, you can gain insights into the performance and health of the deployed models and effectively monitor the TorchServe server's behavior in real-time. For more detailed information, see the [TorchServe metrics documentation](https://github.com/pytorch/serve/blob/master/docs/metrics.md#torchserve-metrics).

You can access TorchServe metrics logs that are similar to the StatsD format through the Amazon CloudWatch log filter. The following is an example of a TorchServe metrics log:

```
CPUUtilization.Percent:0.0|#Level:Host|#hostname:my_machine_name,timestamp:1682098185
    DiskAvailable.Gigabytes:318.0416717529297|#Level:Host|#hostname:my_machine_name,timestamp:1682098185
```

# Deploy models with DJL Serving
<a name="deploy-models-frameworks-djl-serving"></a>

DJL Serving is a high performance universal stand-alone model serving solution. It takes a deep learning model, several models, or workflows and makes them available through an HTTP endpoint.

You can use one of the DJL Serving [Deep Learning Containers (DLCs)](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html) to serve your models on AWS. To learn about the supported model types and frameworks, see the [DJL Serving GitHub repository](https://github.com/deepjavalibrary/djl-serving).

DJL Serving offers many features that help you to deploy your models with high performance:
+ Ease of use – DJL Serving can serve most models without any modifications. You bring your model artifacts, and DJL Serving can host them.
+ Multiple device and accelerator support – DJL Serving supports deploying models on CPUs, GPUs, and AWS Inferentia.
+ Performance – DJL Serving runs multithreaded inference in a single Java virtual machine (JVM) to boost throughput.
+ Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
+ Auto scaling – DJL Serving automatically scales workers up or down based on the traffic load.
+ Multi-engine support – DJL Serving can simultaneously host models using different frameworks (for example, PyTorch and TensorFlow).
+ Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models and can execute parts of the workflow on CPUs and other parts on GPUs. Models within a workflow can leverage different frameworks.

The following sections describe how to set up an endpoint with DJL Serving on SageMaker AI.

## Getting started
<a name="deploy-models-frameworks-djl-prereqs"></a>

To get started, ensure that you have the following prerequisites:

1. Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

1. Ensure that you have the [docker](https://docs.docker.com/get-docker/) client set up on your system.

1. Log in to Amazon Elastic Container Registry and set the following environment variables:

   ```
   export ACCOUNT_ID=<your_account_id>
   export REGION=<your_region>
   aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
   ```

1. Pull the docker image.

   ```
   docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
   ```

   For all of the available DJL Serving container images, see the [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and the [DJL Serving CPU inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#djl-cpu-full-inference-containers). When choosing an image from the tables in the preceding links, replace the AWS region in the example URL column with the region you are in. The DLCs are available in the regions listed in the table at the top of the [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) page.

## Customize your container
<a name="deploy-models-frameworks-djl-byoc"></a>

You can add packages to the base DLC images to customize your container. Suppose you want to add a package to the `763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118` docker image. You must create a dockerfile with your desired image as the base image, add the required packages, and push the image to Amazon ECR.

To add a package, complete the following steps:

1. Specify instructions for running your desired libraries or packages in the base image's dockerfile.

   ```
   FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
                           
   ## add custom packages/libraries
   RUN git clone https://github.com/awslabs/amazon-sagemaker-examples
   ```

1. Build the Docker image from the dockerfile. Specify your Amazon ECR repository, the name of the base image, and a tag for the image. If you don't have an Amazon ECR repository, see [ Using Amazon ECR with the AWS CLI](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html) in the *Amazon ECR User Guide* for instructions on how to create one.

   ```
   docker build -f Dockerfile -t <registry>/<image_name>:<image_tag>
   ```

1. Push the Docker image to your Amazon ECR repository.

   ```
   docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/<image_name>:<image_tag>
   ```

You should now have a customized container image that you can use for model serving. For more examples of customizing your container, see [Building AWS Deep Learning Containers Custom Images](https://github.com/aws/deep-learning-containers/blob/master/custom_images.md).

## Prepare your model artifacts
<a name="deploy-models-frameworks-djl-artifacts"></a>

Before deploying your model on SageMaker AI, you must package your model artifacts in a `.tar.gz` file. DJL Serving accepts the following artifacts in your archive:
+ Model checkpoint: Files that store your model weights.
+ `serving.properties`: A configuration file that you can add for each model. Place `serving.properties` in the same directory as your model file.
+ `model.py`: The inference handler code. This is only applicable when using Python mode. If you don't specify `model.py`, djl-serving uses one of the default handlers.

The following is an example of a `model.tar.gz` structure:

```
 - model_root_dir # root directory
    - serving.properties            
    - model.py # your custom handler file for Python, if you choose not to use the default handlers provided by DJL Serving
    - model binary files # used for Java mode, or if you don't want to use option.model_id and option.s3_url for Python mode
```

DJL Serving supports Java engines powered by DJL or Python engines. Not all of the preceding artifacts are required; the required artifacts vary based on the mode you choose. For example, in Python mode, you only need to specify `option.model_id` in the `serving.properties` file; you don't need to specify the model checkpoint inside LMI containers. In Java mode, you are required to package the model checkpoint. For more details on how to configure `serving.properties` and operate with different engines, see [DJL Serving Operation Modes](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md).

## Use single model endpoints to deploy with DJL Serving
<a name="deploy-models-frameworks-djl-single-model"></a>

After preparing your model artifacts, you can deploy your model to a SageMaker AI endpoint. This section describes how to deploy a single model to an endpoint with DJL Serving. If you're deploying multiple models, skip this section and go to [Use multi-model endpoints to deploy with DJL Serving](#deploy-models-frameworks-djl-mme).

The following example shows you a method to create a model object using the Amazon SageMaker Python SDK. You'll need to specify the following fields:
+ `image_uri`: You can either retrieve one of the base DJL Serving images as shown in this example, or you can specify a custom Docker image from your Amazon ECR repository, if you followed the instructions in [Customize your container](#deploy-models-frameworks-djl-byoc).
+ `model_s3_url`: This should an Amazon S3 URI that points to your `.tar.gz`file.
+ `model_name`: Specify a name for the model object.

```
import boto3
 import sagemaker
from sagemaker.model import Model
from sagemaker import image_uris, get_execution_role

aws_region = "aws-region"
sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=aws_region))
role = get_execution_role()

def create_model(model_name, model_s3_url):
    # Get the DJL DeepSpeed image uri
    image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sagemaker_session.boto_session.region_name,
        version="0.20.0"
    )
    model = Model(
        image_uri=image_uri,
        model_data=model_s3_url,
        role=role,
        name=model_name,
        sagemaker_session=sagemaker_session,
    )
    return model
```

## Use multi-model endpoints to deploy with DJL Serving
<a name="deploy-models-frameworks-djl-mme"></a>

If you want to deploy multiple models to an endpoint, SageMaker AI offers multi-model endpoints, which are a scalable and cost-effective solution to deploying large numbers of models. DJL Serving also supports loading multiple models simultaneously and running inference on each of the models concurrently. DJL Serving containers adhere to the SageMaker AI multi-model endpoints contracts and can be used to deploy multi-model endpoints.

Each individual model artifact needs to be packaged in the same way as described in the previous section [Prepare your model artifacts](#deploy-models-frameworks-djl-artifacts). You can set model-specific configurations in the `serving.properties` file and model-specific inference handler code in `model.py`. For a multi-model endpoint, models need to be arranged in the following way:

```
 root_dir
        |-- model_1.tar.gz
        |-- model_2.tar.gz
        |-- model_3.tar.gz
            .
            .
            .
```

The Amazon SageMaker Python SDK uses the [MultiDataModel](https://sagemaker.readthedocs.io/en/stable/api/inference/multi_data_model.html) object to instantiate a multi-model endpoint. The Amazon S3 URI for the root directory should be passed as the `model_data_prefix` argument to the `MultiDataModel` constructor.

DJL Serving also provides several configuration parameters to manage model memory requirements, such as `required_memory_mb` and `reserved_memory_mb`, that can be configured for each model in the [serving.properties](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md#servingproperties) file. These parameters are useful to handle out of memory errors more gracefully. For all of the configurable parameters, see [OutofMemory handling in djl-serving](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/out_of_memory_management.md).

The auto scaling feature of DJL Serving makes it easy to ensure that the models are scaled appropriately for incoming traffic. By default, DJL Serving determines the maximum number of workers for a model that can be supported based on the hardware available (such as CPU cores or GPU devices). You can set lower and upper bounds for each model to ensure that a minimum traffic level can always be served, and that a single model does not consume all available resources. You can set the following properties in the [serving.properties](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md#servingproperties) file:
+ `gpu.minWorkers`: Minimum number of workers for GPUs.
+ `gpu.maxWorkers`: Maximum number of workers for GPUs.
+ `cpu.minWorkers`: Minimum number of workers for CPUs.
+ `cpu.maxWorkers`: Maximum number of workers for CPUs.

For an end-to-end example of how to deploy a multi-model endpoint on SageMaker AI using a DJL Serving container, see the example notebook [Multi-Model-Inference-Demo.ipynb](https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/Multi-Model-Inference-Demo.ipynb).

# Model deployment with Triton Inference Server
<a name="deploy-models-frameworks-triton"></a>

[Triton Inference Server](https://github.com/triton-inference-server/server) is an open source inference serving software that streamlines AI inference. With Triton, you can deploy any model built with multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more.

The SageMaker AI Triton containers help you deploy Triton Inference Server on the SageMaker AI Hosting platform to serve trained models in production. It supports the different modes in which SageMaker AI operates. For a list of available Triton Inference Server containers available on SageMaker AI, see [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). 

For end-to-end notebook examples, we recommend taking a look at the [amazon-sagemaker-examples repository](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-triton).

## Hosting modes
<a name="deploy-models-frameworks-triton-modes"></a>

The following SageMaker AI Hosting modes are supported by Triton containers:
+ Single model endpoints
  + This is SageMaker AI’s default mode of operation. In this mode, the Triton container can load a single model, or a single ensemble model.
  + The name of the model must be passed as as a property of the container environment, which is part of the `CreateModel` SageMaker AI API call. The environment variable used to pass in the model name is `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`.
+ Single model endpoints with ensemble
  + Triton Inference Server supports *ensemble*, which is a pipeline, or a DAG (directed acyclic graph) of models. While an ensemble technically comprises of multiple models, in the default single model endpoint mode, SageMaker AI can treat the *ensemble proper* (the meta-model that represents the pipeline) as the main model to load, and can subsequently load the associated models.
  + The ensemble proper’s model name must be used to load the model. It must be passed as a property of the container environment, which is part of the `CreateModel` SageMaker API call. The environment variable used to pass in the model name is `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`.
+ Multi-model endpoints
  + In this mode, SageMaker AI can serve multiple models on a single endpoint. You can use this mode by specifying the environment variable `‘MultiModel’: true` as a property of the container environment, which is part of the `CreateModel` SageMaker API call.
  + By default, no model is loaded when the instance starts. To run an inference request against a particular model, specify the corresponding model's `*.tar.gz` file as an argument to the `TargetModel` property of the `InvokeEndpoint` SageMaker API call.
+ Multi-model endpoints with ensemble
  + In this mode, SageMaker AI functions as described for multi-model endpoints. However, the SageMaker AI Triton container can load multiple ensemble models, meaning that multiple model pipelines can run on the same instance. SageMaker AI treats every ensemble as one model, and the ensemble proper of each model can be invoked by specifying the corresponding `*.tar.gz` archive as the `TargetModel`.
  + For better memory management during dynamic memory `LOAD` and `UNLOAD`, we recommend that you keep the ensemble size small.

## Inference payload types
<a name="deploy-models-frameworks-triton-payloads"></a>

Triton supports two methods of sending an inference payload over the network – `json` and `binary+json` (or binary encoded json). The JSON payload in both cases includes the datatype, shape and the actual inference request tensor. The request tensor must be a binary tensor.

With the `binary+json` format, you must specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. In the SageMaker AI Triton container, this is done using a custom `Content-Type` header: `application/vnd.sagemaker-triton.binary+json;json-header-size={}`. This is different from using the `Inference-Header-Content-Length` header on a stand-alone Triton Inference Server because custom headers are not allowed in SageMaker AI.

## Using config.pbtxt to set the model config
<a name="deploy-models-frameworks-triton-config"></a>

For Triton Inference Servers on SageMaker AI, each model must include a `config.pbtxt` file that specifies, at a minimum, the following configurations for the model:
+ `name`: While this is optional for models running outside of SageMaker AI, we recommend that you always provide a name for the models to be run in Triton on SageMaker AI.
+ [`platform` and/or `backend`](https://github.com/triton-inference-server/backend/blob/main/README.md#backends): Setting a backend is essential to specify the type of the model. Some backends have further classification, such as `tensorflow_savedmodel` or ` tensorflow_graphdef`. Such options can be specified as part of the `platform` key in addition to the `backend` key. The most common backends are `tensorrt`, `onnxruntime`, `tensorflow`, `pytorch`, `python`, `dali`, `fil`, and `openvino`.
+ `input`: Specify three attributes for the input: `name`, `data_type` and `dims` (the shape).
+ `output`: Specify three attributes for the output: `name`, `data_type` and `dims` (the shape).
+ `max_batch_size`: Set the batch size to a value greater than or equal to 1 that indicates the maximum batch size that Triton should use with the model.

For more details on configuring `config.pbtxt`, see Triton’s GitHub [repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md). Triton provides several configurations for tweaking model behavior. Some of the most common and important configuration options are:
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups): Instance groups help with specifying the number and location for a given model. They have the attributes `count`, `kind`, and `gpus` (used when `kind` is `KIND_GPU`). The `count` attribute is equivalent to the number of workers. For regular model serving, each worker has its own copy of the model. Similarly, in Triton, the `count` specifies the number of model copies per device. For example, if the `instance_group` type is `KIND_CPU`, then the CPU has `count` number of model copies.
**Note**  
On a GPU instance, the `instance_group` configuration applies per GPU device. For example, `count` number of model copies are placed on each GPU device unless you explicitly specify which GPU devices should load the model.
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) and [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models): Dynamic batching is used for stateless models, and sequence batching is used for stateful models (where you want to route a request to the same model instance every time). Batching schedulers enable a per-model queue, which help in increasing throughput, depending on the batching configuration.
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models): An ensemble model represents a *pipeline* of one or more models and the connection of input and output tensors between those models. It can be configured by specifying `platform` as `ensemble`. The ensemble configuration is just a representation of the model pipeline. On SageMaker AI, all the models under an ensemble are treated as dependents of the ensemble model and are counted as a single model for SageMaker AI metrics, such as `LoadedModelCount`.

## Publishing default Triton metrics to Amazon CloudWatch
<a name="deploy-models-frameworks-triton-metrics"></a>

The NVIDIA Triton Inference Container exposes metrics at port 8002 (configurable) for the different models and GPUs that are utilized in the Triton Inference Server. For full details of the default metrics that are available, see the GitHub page for the [Triton Inference Server metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md). These metrics are in Prometheus format and can be scraped using a Prometheus scraper configuration.

Starting with version v23.07 onwards, the SageMaker AI Triton container supports publishing these metrics to Amazon CloudWatch by specifying a few environment variables. In order to scrape the Prometheus metrics, the SageMaker AI Triton container leverages the Amazon CloudWatch agent.

The required environment variables that you must specify to collect metrics are as follows:


| Environment variable | Description | Example value | 
| --- | --- | --- | 
|  `SAGEMAKER_TRITON_ALLOW_METRICS`  |  Specify this option to allow Triton to publish metrics to its Prometheus endpoint.  | "true" | 
|  `SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH`  |  Specify this option to start the pre-checks necessary to publish metrics to Amazon CloudWatch.  | "true" | 
|  `SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP`  |  Specify this option to point to the log group to which metrics are written.  | "/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesTest" | 
|  `SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE`  |  Specify this option to point to the metric namespace where you want to see and plot the metrics.  | "/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesPublicTest" | 
|  `SAGEMAKER_TRITON_METRICS_PORT`  |  Specify this as 8002, or any other port. If SageMaker AI has not blocked the specified port, it is used. Otherwise, another non-blocked port is chosen automatically.  | "8002" | 

When publishing metrics with Triton on SageMaker AI, keep in mind the following limitations:
+ While you can generate custom metrics through the C-API and Python backend (v23.05 onwards), these are currently not supported for publishing to Amazon CloudWatch.
+ In SageMaker AI multi-model endpoints (MME) mode, Triton runs in an environment that requires model namespacing to be enabled because each model (except ensemble models) are treated as if they are in their own model repository. Currently, this creates a limitation for metrics. When model namespacing is enabled, Triton does not distinguish the metrics between two models with the same name belonging to different ensembles. As a workaround, make sure that every model being deployed has a unique name. This also makes it easier to look up your metrics in CloudWatch.

## Environment variables
<a name="deploy-models-frameworks-triton-variables"></a>

The following table lists the supported environment variables for Triton on SageMaker AI.


| Environment variable | Description | Type | Possible values | 
| --- | --- | --- | --- | 
| `SAGEMAKER_MULTI_MODEL` | Allows Triton to operate in SageMaker AI multi-model endpoints mode. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` | Specify the model to be loaded in the SageMaker AI single model (default) mode. For ensemble mode, specify the name of the ensemble proper. | String | *<model\$1name>* as specified in config.pbtxt | 
| `SAGEMAKER_TRITON_PING_MODE` | `'ready'` is the default mode in SageMaker AI's single model mode, and `'live'` is the default in SageMaker AI's multi-model endpoints mode. | String | `ready`, `live` | 
| `SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING` | In the SageMaker AI Triton container, this is set to `true` by default. | Boolean | `true`, `false` | 
| `SAGEMAKER_BIND_TO_PORT` | While on SageMaker AI, the default port is 8080. You can customize to a different port in multi-container scenarios. | String | *<port\$1number>* | 
| `SAGEMAKER_SAFE_PORT_RANGE` | This is set by the SageMaker AI platform when using multi-container mode. | String | *<port\$11>*–*<port\$12>* | 
| `SAGEMAKER_TRITON_ALLOW_GRPC` | While SageMaker AI doesn't support GRPC currently, if you're using Triton in front of a custom reverse proxy, you may choose to enable GRPC. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_GRPC_PORT` | The default port for GRPC is 8001, but you can change it. | String | *<port\$1number>* | 
| `SAGEMAKER_TRITON_THREAD_COUNT` | You can set the number of default HTTP request handler threads. | String | *<number>* | 
| `SAGEMAKER_TRITON_LOG_VERBOSE` | `true` by default on SageMaker AI, but you can selectively turn this option off. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_INFO` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_WARNING` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_ERROR` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE` | Specify the shm size for the Python backend, in bytes. The default value is 16 MB but can be increased. | String | *<number>* | 
| `SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE` | Specify the shm growth size for the Python backend, in bytes. The default value is 1 MB but can be increased to allow greater increments. | String | *<number>* | 
| `SAGEMAKER_TRITON_TENSORFLOW_VERSION` | The default value is `2`. Triton no longer supports Tensorflow 2 from Triton v23.04. You can configure this variable for previous versions. | String | *<number>* | 
| `SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT` | Restrict the maximum GPU memory percentage which is used for model loading, allowing the remainder to be used for the inference requests. | String | *<number>* | 
| `SAGEMAKER_TRITON_ALLOW_METRICS` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_METRICS_PORT` | The default port is 8002. | String | *<number>* | 
| `SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH` | `false` by default on SageMaker AI. Set this variable to `true` to allow pushing Triton default metrics to Amazon CloudWatch. If this option is enabled, you are responsible for CloudWatch costs when metrics are published to your account. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP` | Required if you've enabled metrics publishing to CloudWatch. | String | *<cloudwatch\$1log\$1group\$1name>* | 
| `SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE` | Required if you've enabled metrics publishing to CloudWatch. | String | *<cloudwatch\$1metric\$1namespace>* | 
| `SAGEMAKER_TRITON_ADDITIONAL_ARGS` | Appends any additional arguments when starting the Triton Server. | String | *<additional\$1args>* | 

# Model deployment at the edge with SageMaker Edge Manager
<a name="edge"></a>

**Warning**  
 SageMaker Edge Manager is being discontinued on April 26th, 2024. For more information about continuing to deploy your models to edge devices, see [SageMaker Edge Manager end of life](edge-eol.md). 

Amazon SageMaker Edge Manager provides model management for edge devices so you can optimize, secure, monitor, and maintain machine learning models on fleets of edge devices such as smart cameras, robots, personal computers, and mobile devices.

## Why Use Edge Manager?
<a name="edge-what-it-is"></a>

Many machine learning (ML) use cases require running ML models on a fleet of edge devices, which allows you to get predictions in real-time, preserves the privacy of the end users, and lowers the cost of network connectivity. With the increasing availability of low-power edge hardware designed for ML, it is now possible to run multiple complex neural network models on edge devices. 

However, operating ML models on edge devices is challenging, because devices, unlike cloud instances, have limited compute, memory, and connectivity. After the model is deployed, you need to continuously monitor the models, because model drift can cause the quality of model to decay overtime. Monitoring models across your device fleets is difficult because you need to write custom code to collect data samples from your device and recognize skew in predictions. In addition, models are often hard-coded into the application. To update the model, you must rebuild and update the entire application or device firmware, which can disrupt your operations.

With SageMaker Edge Manager, you can optimize, run, monitor, and update machine learning models across fleets of devices at the edge.

## How Does it Work?
<a name="edge-how-it-works"></a>

At a high level, there are five main components in the SageMaker Edge Manager workflow: compiling models with SageMaker Neo, packaging Neo-compiled models, deploying models to your devices, running models on the SageMaker AI inference engine (Edge Manager agent), and maintaining models on the devices.

![\[The five main components in the SageMaker Edge Manager workflow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/smith_overview.png)


SageMaker Edge Manager uses SageMaker Neo to optimize your models for the target hardware in one click, then to cryptographically sign your models before deployment. Using SageMaker Edge Manager, you can sample model input and output data from edge devices and send it to the cloud for monitoring and analysis, and view a dashboard that tracks and visually reports on the operation of the deployed models within the SageMaker AI console.

SageMaker Edge Manager extends capabilities that were previously only available in the cloud to the edge, so developers can continuously improve model quality by using Amazon SageMaker Model Monitor for drift detection, then relabel the data with SageMaker AI Ground Truth and retrain the models in SageMaker AI.

## How Do I Use SageMaker Edge Manager?
<a name="edge-how-to-use"></a>

If you are a first time user of SageMaker Edge Manager, we recommend that you do the following:

1. **Read the [Getting Started](https://docs.aws.amazon.com/sagemaker/latest/dg/edge-manager-getting-started.html) section** - This section walks you through setting up your first edge packaging job and creating your first fleet.

1. **Explore Edge Manager Jupyter notebook examples **- Example notebooks are stored in the [amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) GitHub repository in the [sagemaker\$1edge\$1manager](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_edge_manager) folder.

# First Steps with Amazon SageMaker AI Edge Manager
<a name="edge-manager-getting-started"></a>

This guide demonstrates how to complete the necessary steps to register, deploy, and manage a fleet of devices, and how to satisfy Amazon SageMaker AI Edge Manager prerequisites. 

**Topics**
+ [

# Setting Up
](edge-getting-started-step1.md)
+ [

# Prepare Your Model for Deployment
](edge-getting-started-step2.md)
+ [

# Register and Authenticate Your Device Fleet
](edge-getting-started-step3.md)
+ [

# Download and Set Up Edge Manager
](edge-getting-started-step4.md)
+ [

# Run Agent
](edge-getting-started-step5.md)

# Setting Up
<a name="edge-getting-started-step1"></a>

Before you begin using SageMaker Edge Manager to manage models on your device fleets, you must first create IAM Roles for both SageMaker AI and AWS IoT. You will also want to create at least one Amazon S3 bucket where you will store your pre-trained model, the output of your SageMaker Neo compilation job, as well as input data from your edge devices.

## Sign up for an AWS account
<a name="sign-up-for-aws"></a>

If you do not have an AWS account, complete the following steps to create one.

**To sign up for an AWS account**

1. Open [https://portal.aws.amazon.com/billing/signup](https://portal.aws.amazon.com/billing/signup).

1. Follow the online instructions.

   Part of the sign-up procedure involves receiving a phone call or text message and entering a verification code on the phone keypad.

   When you sign up for an AWS account, an *AWS account root user* is created. The root user has access to all AWS services and resources in the account. As a security best practice, assign administrative access to a user, and use only the root user to perform [tasks that require root user access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_root-user.html#root-user-tasks).

AWS sends you a confirmation email after the sign-up process is complete. At any time, you can view your current account activity and manage your account by going to [https://aws.amazon.com/](https://aws.amazon.com/) and choosing **My Account**.

## Create a user with administrative access
<a name="create-an-admin"></a>

After you sign up for an AWS account, secure your AWS account root user, enable AWS IAM Identity Center, and create an administrative user so that you don't use the root user for everyday tasks.

**Secure your AWS account root user**

1.  Sign in to the [AWS Management Console](https://console.aws.amazon.com/) as the account owner by choosing **Root user** and entering your AWS account email address. On the next page, enter your password.

   For help signing in by using root user, see [Signing in as the root user](https://docs.aws.amazon.com/signin/latest/userguide/console-sign-in-tutorials.html#introduction-to-root-user-sign-in-tutorial) in the *AWS Sign-In User Guide*.

1. Turn on multi-factor authentication (MFA) for your root user.

   For instructions, see [Enable a virtual MFA device for your AWS account root user (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/enable-virt-mfa-for-root.html) in the *IAM User Guide*.

**Create a user with administrative access**

1. Enable IAM Identity Center.

   For instructions, see [Enabling AWS IAM Identity Center](https://docs.aws.amazon.com//singlesignon/latest/userguide/get-set-up-for-idc.html) in the *AWS IAM Identity Center User Guide*.

1. In IAM Identity Center, grant administrative access to a user.

   For a tutorial about using the IAM Identity Center directory as your identity source, see [ Configure user access with the default IAM Identity Center directory](https://docs.aws.amazon.com//singlesignon/latest/userguide/quick-start-default-idc.html) in the *AWS IAM Identity Center User Guide*.

**Sign in as the user with administrative access**
+ To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email address when you created the IAM Identity Center user.

  For help signing in using an IAM Identity Center user, see [Signing in to the AWS access portal](https://docs.aws.amazon.com/signin/latest/userguide/iam-id-center-sign-in-tutorial.html) in the *AWS Sign-In User Guide*.

**Assign access to additional users**

1. In IAM Identity Center, create a permission set that follows the best practice of applying least-privilege permissions.

   For instructions, see [ Create a permission set](https://docs.aws.amazon.com//singlesignon/latest/userguide/get-started-create-a-permission-set.html) in the *AWS IAM Identity Center User Guide*.

1. Assign users to a group, and then assign single sign-on access to the group.

   For instructions, see [ Add groups](https://docs.aws.amazon.com//singlesignon/latest/userguide/addgroups.html) in the *AWS IAM Identity Center User Guide*.

## Create roles and storage
<a name="edge-getting-started-step1-create-role"></a>

SageMaker Edge Manager needs access to your Amazon S3 bucket URI. To facilitate this, create an IAM role that can run SageMaker AI and has permission to access Amazon S3. Using this role, SageMaker AI can run under your account and access to your Amazon S3 bucket.

You can create an IAM role by using the IAM console, AWS SDK for Python (Boto3), or AWS CLI. The following is an example of how to create an IAM role, attach the necessary policies with the IAM console, and create an Amazon S3 bucket.

1. **Create an IAM role for Amazon SageMaker AI.**

   1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. In the navigation pane of the IAM console, choose **Roles**, and then choose **Create role**.

   1. For **Select type of trusted entity**, choose **AWS service**.

   1. Choose the service that you want to allow to assume this role. In this case, choose **SageMaker AI**. Then choose **Next: Permissions**.
      + This automatically creates an IAM policy that grants access to related services such as Amazon S3, Amazon ECR, and CloudWatch Logs.

   1. Choose **Next: Tags**.

   1. (Optional) Add metadata to the role by attaching tags as key–value pairs. For more information about using tags in IAM, see [Tagging IAM resources](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_tags.html).

   1. Choose **Next: Review**.

   1. Type in a **Role name**. 

   1. If possible, type a role name or role name suffix. Role names must be unique within your AWS account. They are not distinguished by case. For example, you cannot create roles named both `PRODROLE` and `prodrole`. Because other AWS resources might reference the role, you cannot edit the name of the role after it has been created.

   1. (Optional) For **Role description**, type a description for the new role.

   1. Review the role and then choose **Create role**.

      Note the SageMaker AI Role ARN, which you use to create a compilation job with SageMaker Neo and a packaging job with Edge Manager. To find out the role ARN using the console, do the following:

      1. Go to the IAMconsole: [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)

      1. Select **Roles**.

      1. Search for the role you just created by typing in the name of the role in the search field.

      1. Select the role.

      1. The role ARN is at the top of the **Summary** page.

1. **Create an IAM role for AWS IoT.**

   The AWS IoT IAM role you create is used to authorize your thing objects. You also use the IAM role ARN to create and register device fleets with a SageMaker AI client object.

   Configure an IAM role in your AWS account for the credentials provider to assume on behalf of the devices in your device fleet. Then, attach a policy to authorize your devices to interact with AWS IoT services.

   Create a role for AWS IoT either programmatically or with the IAM console, similar to what you did when you created a role for SageMaker AI.

   1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. In the navigation pane of the IAM console, choose **Roles**, and then choose **Create role**.

   1. For **Select type of trusted entity**, choose **AWS service**.

   1. Choose the service that you want to allow to assume this role. In this case, choose **IoT**. Select **IoT** as the **Use Case**.

   1. Choose **Next: Permissions**.

   1. Choose **Next: Tags**.

   1. (Optional) Add metadata to the role by attaching tags as key–value pairs. For more information about using tags in IAM, see [Tagging IAM resources](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_tags.html).

   1. Choose **Next: Review**.

   1. Type in a **Role name**. The role name must start with `SageMaker AI`.

   1. (Optional) For **Role description**, type a description for the new role.

   1. Review the role and then choose **Create role**.

   1. Once the role is created, choose **Roles** in the IAM console. Search for the role you created by typing in role name in the **Search** field.

   1. Choose your role.

   1. Next, choose **Attach Policies**.

   1. Search for `AmazonSageMakerEdgeDeviceFleetPolicy` in the **Search** field. Select `AmazonSageMakerEdgeDeviceFleetPolicy`.

   1. Choose **Attach policy**.

   1. Add the following policy statement to the trust relationship:

------
#### [ JSON ]

****  

      ```
      {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
              "Effect": "Allow",
              "Principal": {"Service": "credentials.iot.amazonaws.com"},
              "Action": "sts:AssumeRole"
            },
            {
              "Effect": "Allow",
              "Principal": {"Service": "sagemaker.amazonaws.com"},
              "Action": "sts:AssumeRole"
            }
        ]
      }
      ```

------

      A trust policy is a [JSON policy document](https://docs.aws.amazon.com//IAM/latest/UserGuide/reference_policies_grammar) in which you define the principals that you trust to assume the role. For more information about trust policies, see [Roles terms and concepts](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html).

   1. Note the AWS IoT role ARN. You use the AWS IoT Role ARN to create and register the device fleet. To find the IAM role ARN with the console:

      1. Go to the IAM console: [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)

      1. Choose **Roles**.

      1. Search for the role you created by typing in the name of the role in the **Search** field.

      1. Select the role.

      1. The role ARN is on the Summary page.

1. **Create an Amazon S3 bucket.**

   SageMaker Neo and Edge Manager access your pre-compiled model and compiled model from an Amazon S3 bucket. Edge Manager also stores sample data from your device fleet in Amazon S3.

   1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

   1. Choose **Create bucket**.

   1. In **Bucket name**, enter a name for your bucket.

   1. In **Region**, choose the AWS Region where you want the bucket to reside.

   1. In **Bucket settings for Block Public Access**, choose the settings that you want to apply to the bucket.

   1. Choose **Create bucket**.

   For more information about creating Amazon S3 buckets, see [Getting started with Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html).

# Prepare Your Model for Deployment
<a name="edge-getting-started-step2"></a>

In this section you will create SageMaker AI and AWS IoT client objects, download a pre-trained machine learning model, upload your model to your Amazon S3 bucket, compile your model for your target device with SageMaker Neo, and package your model so that it can be deployed with the Edge Manager agent.

1. **Import libraries and create client objects.**

   This tutorial uses the AWS SDK for Python (Boto3) to create clients to interact with SageMaker AI, Amazon S3, and AWS IoT.

   Import Boto3, specify your Region, and initialize the client objects you need as shown in the following example:

   ```
   import boto3
   import json
   import time
   
   AWS_REGION = 'us-west-2'# Specify your Region
   bucket = 'bucket-name'
   
   sagemaker_client = boto3.client('sagemaker', region_name=AWS_REGION)
   iot_client = boto3.client('iot', region_name=AWS_REGION)
   ```

   Define variables and assign them the role ARN you created for SageMaker AI and AWS IoT as strings:

   ```
   # Replace with the role ARN you created for SageMaker
   sagemaker_role_arn = "arn:aws:iam::<account>:role/*"
   
   # Replace with the role ARN you created for AWS IoT. 
   # Note: The name must start with 'SageMaker'
   iot_role_arn = "arn:aws:iam::<account>:role/SageMaker*"
   ```

1. **Train a machine learning model.**

   See [Train a Model with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) for more information on how to train a machine learning model using SageMaker AI. You can optionally upload your locally trained model directly into an Amazon S3 URI bucket.

   If you do not have a model yet, you can use a pre-trained model for the next steps in this tutorial. For example, you can save the MobileNet V2 models from the TensorFlow framework. MobileNet V2 is an image classification model optimized for mobile applications. For more information about MobileNet V2, see the [MobileNet GitHub README](https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet).

   Type the following into your Jupyter Notebook to save the pre-trained MobileNet V2 model:

   ```
   # Save the MobileNet V2 model to local storage
      import tensorflow as tf
      model = tf.keras.applications.MobileNetV2()
      model.save(“mobilenet_v2.h5”)
   ```
**Note**  
If you do not have TensorFlow installed, you can do so by running `pip install tensorflow=2.4`
Use TensorFlow version 2.4 or lower for this tutorial.

   The model will be saved into the `mobilenet_v2.h5` file. Before packaging the model, you will need to first compile your model using SageMaker Neo. See [Supported Frameworks, Devices, Systems, and Architectures](neo-supported-devices-edge.md) to check if your version of TensorFlow (or other framework of choice) is currently supported by SageMaker Neo.

   SageMaker Neo requires models to be stored as a compressed TAR file. Repackage it as a compressed TAR file (\$1.tar.gz):

   ```
   # Package MobileNet V2 model into a TAR file 
      import tarfile
      
      tarfile_name='mobilenet-v2.tar.gz'
      
      with tarfile.open(tarfile_name, mode='w:gz') as archive:
          archive.add('mobilenet-v2.h5')
   ```

1. **Upload your model to Amazon S3.**

   Once you have a machine learning model, store it in an Amazon S3 bucket. The following example uses an AWS CLI command to upload the model the to the Amazon S3 bucket you created earlier in a directory called *models*. Type in the following into your Jupyter Notebook:

   ```
   !aws s3 cp mobilenet-v2.tar.gz s3://{bucket}/models/
   ```

1. **Compile your model with SageMaker Neo.**

   Compile your machine learning model with SageMaker Neo for an edge device. You need to know your Amazon S3 bucket URI where you stored the trained model, the machine learning framework you used to train your model, the shape of your model’s input, and your target device.

   For the MobileNet V2 model, use the following:

   ```
   framework = 'tensorflow'
   target_device = 'jetson_nano'
   data_shape = '{"data":[1,3,224,224]}'
   ```

   SageMaker Neo requires a specific model input shape and model format based on the deep learning framework you use. For more information about how to save your model, see [What input data shapes does SageMaker Neo expect?](neo-compilation-preparing-model.md#neo-job-compilation-expected-inputs). For more information about devices and frameworks supported by Neo, see [Supported Frameworks, Devices, Systems, and Architectures](neo-supported-devices-edge.md).

   Use the `CreateCompilationJob` API to create a compilation job with SageMaker Neo. Provide a name to the compilation job, the SageMaker AI Role ARN, the Amazon S3 URI where your model is stored, the input shape of the model, the name of the framework, the Amazon S3 URI where you want SageMaker AI to store your compiled model, and your edge device target.

   ```
   # Specify the path where your model is stored
   model_directory = 'models'
   s3_model_uri = 's3://{}/{}/{}'.format(bucket, model_directory, tarfile_name)
   
   # Store compiled model in S3 within the 'compiled-models' directory
   compilation_output_dir = 'compiled-models'
   s3_output_location = 's3://{}/{}/'.format(bucket, compilation_output_dir)
   
   # Give your compilation job a name
   compilation_job_name = 'getting-started-demo'
   
   sagemaker_client.create_compilation_job(CompilationJobName=compilation_job_name,
                                           RoleArn=sagemaker_role_arn,
                                           InputConfig={
                                               'S3Uri': s3_model_uri,
                                               'DataInputConfig': data_shape,
                                               'Framework' : framework.upper()},
                                           OutputConfig={
                                               'S3OutputLocation': s3_output_location,
                                               'TargetDevice': target_device},
                                           StoppingCondition={'MaxRuntimeInSeconds': 900})
   ```

1. **Package your compiled model.**

   Packaging jobs take SageMaker Neo–compiled models and make any changes necessary to deploy the model with the inference engine, Edge Manager agent. To package your model, create an edge packaging job with the `create_edge_packaging` API or the SageMaker AI console.

   You need to provide the name that you used for your Neo compilation job, a name for the packaging job, a role ARN (see [Setting Up](edge-getting-started-step1.md) section), a name for the model, a model version, and the Amazon S3 bucket URI for the output of the packaging job. Note that Edge Manager packaging job names are case-sensitive. The following is an example of how to create a packaging job using the API.

   ```
   edge_packaging_name='edge-packaging-demo'
   model_name="sample-model"
   model_version="1.1"
   ```

   Define the Amazon S3 URI where you want to store the packaged model.

   ```
   # Output directory where you want to store the output of the packaging job
   packaging_output_dir = 'packaged_models'
   packaging_s3_output = 's3://{}/{}'.format(bucket, packaging_output_dir)
   ```

   Use `CreateEdgePackagingJob` to package your Neo-compiled model. Provide a name for your edge packaging job and the name you provided for your compilation job (in this example, it was stored in the variable `compilation_job_name`). Also provide a name for your model, a version for your model (this is used to help you keep track of what model version you are using), and the S3 URI where you want SageMaker AI to store the packaged model.

   ```
   sagemaker_client.create_edge_packaging_job(
                       EdgePackagingJobName=edge_packaging_name,
                       CompilationJobName=compilation_job_name,
                       RoleArn=sagemaker_role_arn,
                       ModelName=model_name,
                       ModelVersion=model_version,
                       OutputConfig={
                           "S3OutputLocation": packaging_s3_output
                           }
                       )
   ```

# Register and Authenticate Your Device Fleet
<a name="edge-getting-started-step3"></a>

In this section you will create your AWS IoT thing object, create a device fleet, register your device fleet so it can interact with the cloud, create X.509 certificates to authenticate your devices to AWS IoT Core, associate the role alias with AWS IoT that was generated when you created your fleet, get your AWS account-specific endpoint for the credentials provider, get an official Amazon Root CA file, and upload the Amazon CA file to Amazon S3.

1. **Create AWS IoT things.**

   SageMaker Edge Manager takes advantage of the AWS IoT Core services to facilitate the connection between the edge devices and endpoints in the AWS cloud. You can take advantage of existing AWS IoT functionality after you set up your devices to work with Edge Manager.

   To connect your device to AWS IoT, you need to create AWS IoT *thing objects*, create and register a client certificate with AWS IoT, and create and configure the IAM role for your devices.

   First, create AWS IoT thing objects with the AWS IoT client (`iot_client`) you created earlier with Boto3. The following example shows how to create two thing objects:

   ```
   iot_thing_name = 'sample-device'
   iot_thing_type = 'getting-started-demo'
   
   iot_client.create_thing_type(
       thingTypeName=iot_thing_type
   )
   
   # Create an AWS IoT thing objects
   iot_client.create_thing(
       thingName=iot_thing_name,
       thingTypeName=iot_thing_type
   )
   ```

1. **Create your device fleet.**

   Create a device fleet with the SageMaker AI client object defined in a previous step. You can also use the SageMaker AI console to create a device fleet.

   ```
   import time
   device_fleet_name="demo-device-fleet" + str(time.time()).split('.')[0]
   device_name="sagemaker-edge-demo-device" + str(time.time()).split('.')[0]
   ```

   Specify your IoT role ARN. This lets AWS IoT grant temporary credentials to devices.

   ```
   device_model_directory='device_output'
   s3_device_fleet_output = 's3://{}/{}'.format(bucket, device_model_directory)
   
   sagemaker_client.create_device_fleet(
       DeviceFleetName=device_fleet_name,
       RoleArn=iot_role_arn, # IoT Role ARN specified in previous step
       OutputConfig={
           'S3OutputLocation': s3_device_fleet_output
       }
   )
   ```

   An AWS IoT role alias is created when you create a device fleet. This role alias is associated with AWS IoT using the `iot_client` object in a later step.

1. **Register your device fleet.**

   To interact with the cloud, you need to register your device with SageMaker Edge Manager. In this example, you register a single device with the fleet you created. To register the device, you need to provide a device name and the AWS IoT thing name as shown in the following example:

   ```
   # Device name should be 36 characters
   device_name = "sagemaker-edge-demo-device" + str(time.time()).split('.')[0]
   
   sagemaker_client.register_devices(
       DeviceFleetName=device_fleet_name,
       Devices=[
           {
               "DeviceName": device_name,
               "IotThingName": iot_thing_name
           }
       ]
   )
   ```

1. **Create X.509 certificates.**

   After creating the AWS IoT thing object, you must create a X.509 device certificate for your thing object. This certificate authenticates your device to AWS IoT Core.

   Use the following to create a private key, public key, and a X.509 certificate file using the AWS IoT client defined (`iot_client`) earlier.

   ```
   # Creates a 2048-bit RSA key pair and issues an X.509 # certificate 
   # using the issued public key.
   create_cert = iot_client.create_keys_and_certificate(
       setAsActive=True 
   )
   
   # Get certificate from dictionary object and save in its own
   with open('./device.pem.crt', 'w') as f:
       for line in create_cert['certificatePem'].split('\n'):
           f.write(line)
           f.write('\n')
   # Get private key from dictionary object and save in its own 
   with open('./private.pem.key', 'w') as f:
       for line in create_cert['keyPair']['PrivateKey'].split('\n'):
           f.write(line)
           f.write('\n')
   # Get a private key from dictionary object and save in its own 
   with open('./public.pem.key', 'w') as f:
       for line in create_cert['keyPair']['PublicKey'].split('\n'):
           f.write(line)
           f.write('\n')
   ```

1. **Associate the role alias with AWS IoT.**

   When you create a device fleet with SageMaker AI (`sagemaker_client.create_device_fleet()`), a role alias is generated for you. An AWS IoT role alias provides a mechanism for connected devices to authenticate to AWS IoT using X.509 certificates, and then obtain short-lived AWS credentials from an IAM role that is associated with an AWS IoT role alias. The role alias allows you to change the role of the device without having to update the device. Use `DescribeDeviceFleet` to get the role alias name and ARN.

   ```
   # Print Amazon Resource Name (ARN) and alias that has access 
   # to AWS Internet of Things (IoT).
   sagemaker_client.describe_device_fleet(DeviceFleetName=device_fleet_name)
   
   # Store iot role alias string in a variable
   # Grabs role ARN
   full_role_alias_name = sagemaker_client.describe_device_fleet(DeviceFleetName=device_fleet_name)['IotRoleAlias']
   start_index = full_role_alias_name.find('SageMaker') # Find beginning of role name  
   role_alias_name = full_role_alias_name[start_index:]
   ```

   Use the `iot_client` to facilitate associating the role alias generated from creating the device fleet with AWS IoT:

   ```
   role_alias = iot_client.describe_role_alias(
                       roleAlias=role_alias_name)
   ```

   For more information about IAM role alias, see [Role alias allows access to unused services](https://docs.aws.amazon.com/iot/latest/developerguide/audit-chk-role-alias-unused-svcs.html) .

   You created and registered a certificate with AWS IoT earlier for successful authentication of your device. Now, you need to create and attach a policy to the certificate to authorize the request for the security token.

   ```
   alias_policy = {
     "Version": "2012-10-17",		 	 	 
     "Statement": {
       "Effect": "Allow",
       "Action": "iot:AssumeRoleWithCertificate",
       "Resource": role_alias['roleAliasDescription']['roleAliasArn']
     }
   }
   
   policy_name = 'aliaspolicy-'+ str(time.time()).split('.')[0]
   aliaspolicy = iot_client.create_policy(policyName=policy_name,
                                          policyDocument=json.dumps(alias_policy))
   
   # Attach policy
   iot_client.attach_policy(policyName=policy_name,
                               target=create_cert['certificateArn'])
   ```

1. **Get your AWS account-specific endpoint for the credentials provider.**

   Edge devices need an endpoint in order to assume credentials. Obtain your AWS account-specific endpoint for the credentials provider.

   ```
   # Get the unique endpoint specific to your AWS account that is making the call.
   iot_endpoint = iot_client.describe_endpoint(
       endpointType='iot:CredentialProvider'
   )
   
   endpoint="https://{}/role-aliases/{}/credentials".format(iot_endpoint['endpointAddress'],role_alias_name)
   ```

1. **Get the official Amazon root CA file and upload it to the Amazon S3 bucket.**

   Use the following in your Jupyter Notebook or AWS CLI (if you use your terminal, remove the ‘\$1’ magic function):

   ```
   !wget https://www.amazontrust.com/repository/AmazonRootCA1.pem
   ```

   Use the endpoint to make an HTTPS request to the credentials provider to return a security token. The following example command uses `curl`, but you can use any HTTP client.

   ```
   !curl --cert device.pem.crt --key private.pem.key --cacert AmazonRootCA1.pem $endpoint
   ```

   If the certificate is verified, upload the keys and certificate to your Amazon S3 bucket URI:

   ```
   !aws s3 cp private.pem.key s3://{bucket}/authorization-files/
   !aws s3 cp device.pem.crt s3://{bucket}/authorization-files/
   !aws s3 cp AmazonRootCA1.pem s3://{bucket}/authorization-files/
   ```

   Clean your working directory by moving your keys and certificate to a different directory:

   ```
   # Optional - Clean up working directory
   !mkdir authorization-files
   !mv private.pem.key device.pem.crt AmazonRootCA1.pem authorization-files/
   ```

# Download and Set Up Edge Manager
<a name="edge-getting-started-step4"></a>

The Edge Manager agent is an inference engine for your edge devices. Use the agent to make predictions with models loaded onto your edge devices. The agent also collects model metrics and captures data at specific intervals.



In this section you will set up your device with the agent. To do so, first copy a release artifact and signing root certificate from the release bucket locally to your machine. After you unzip the release artifact, upload it to Amazon S3. Next, define and save a configuration file for the agent. A template is provided for you to copy and paste. Finally, copy the release artifacts, configuration file, and credentials to your device.

1. **Download the SageMaker Edge Manager agent.**

   The agent is released in binary format for supported operating systems. This example runs inference on a Jetson Nano which uses a Linux operating system and has an ARM64 architecture. For more information about what operating system and architecture supported devices use, see [Supported Devices, Chip Architectures, and Systems](neo-supported-devices-edge-devices.md).

   Fetch the latest version of binaries from the SageMaker Edge Manager release bucket from the us-west-2 Region.

   ```
   !aws s3 ls s3://sagemaker-edge-release-store-us-west-2-linux-armv8/Releases/ | sort -r
   ```

   This returns release artifacts sorted by their version.

   ```
                              PRE 1.20210512.96da6cc/
                              PRE 1.20210305.a4bc999/
                              PRE 1.20201218.81f481f/
                              PRE 1.20201207.02d0e97/
   ```

   The version has the following format: `<MAJOR_VERSION>.<YYYY-MM-DD>.<SHA-7>`. It consists of three components:
   + `<MAJOR_VERSION>`: The release version. The release version is currently set to `1`.
   + `<YYYY-MM-DD>`: The time stamp of the artifact release.
   + <SHA-7>: The repository commit ID from which the release is built.

   Copy the zipped TAR file locally or to your device directly. The following example shows how to copy the latest release artifact at the time this document was released.

   ```
   !aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-x64/Releases/1.20201218.81f481f/1.20201218.81f481f.tgz ./
   ```

   Once you have the artifact, unzip the zipped TAR file. The following unzips the TAR file and stores it in a directory called `agent_demo`:

   ```
   !mkdir agent_demo
   !tar -xvzf 1.20201218.81f481f.tgz -C ./agent_demo
   ```

   Upload the agent release artifacts to your Amazon S3 bucket. The following code example copies the content within `agent_demo` and uploads it to a directory within your Amazon S3 bucket called `agent_demo`:

   ```
   !aws s3 cp --recursive ./agent_demo s3://{bucket}/agent_demo
   ```

   You also need the signing root certificates from the release bucket:

   ```
   !aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-x64/Certificates/us-west-2/us-west-2.pem ./
   ```

   Upload the signing root certificate to your Amazon S3 bucket:

   ```
   !aws s3 cp us-west-2.pem s3://{bucket}/authorization-files/
   ```

1. **Define a SageMaker Edge Manager agent configuration file.**

   First, define the agent configuration file as follows:

   ```
   sagemaker_edge_config = {
       "sagemaker_edge_core_device_name": "device_name",
       "sagemaker_edge_core_device_fleet_name": "device_fleet_name",
       "sagemaker_edge_core_capture_data_buffer_size": 30,
       "sagemaker_edge_core_capture_data_push_period_seconds": 4,
       "sagemaker_edge_core_folder_prefix": "demo_capture",
       "sagemaker_edge_core_region": "us-west-2",
       "sagemaker_edge_core_root_certs_path": "/agent_demo/certificates",
       "sagemaker_edge_provider_aws_ca_cert_file": "/agent_demo/iot-credentials/AmazonRootCA1.pem",
       "sagemaker_edge_provider_aws_cert_file": "/agent_demo/iot-credentials/device.pem.crt",
       "sagemaker_edge_provider_aws_cert_pk_file": "/agent_demo/iot-credentials/private.pem.key",
       "sagemaker_edge_provider_aws_iot_cred_endpoint": "endpoint",
       "sagemaker_edge_provider_provider": "Aws",
       "sagemaker_edge_provider_s3_bucket_name": bucket,
       "sagemaker_edge_core_capture_data_destination": "Cloud"
   }
   ```

   Replace the following:
   + `"device_name"` with the name of your device (this string was stored in an earlier step in a variable named `device_name`).
   + `"device_fleet_name`" with the name of your device fleet (this string was stored an earlier step in a variable named `device_fleet_name`)
   + `"endpoint"` with your AWS account-specific endpoint for the credentials provider (this string was stored in an earlier step in a variable named `endpoint`).

   Next, save it as a JSON file:

   ```
   edge_config_file = open("sagemaker_edge_config.json", "w")
   json.dump(sagemaker_edge_config, edge_config_file, indent = 6)
   edge_config_file.close()
   ```

   Upload the configuration file to your Amazon S3 bucket:

   ```
   !aws s3 cp sagemaker_edge_config.json s3://{bucket}/
   ```

1. **Copy the release artifacts, configuration file, and credentials to your device.**

   The following instructions are performed on the edge device itself.
**Note**  
You must first install Python, the AWS SDK for Python (Boto3), and the AWS CLI on your edge device. 

   Open a terminal on your device. Create a folder to store the release artifacts, your credentials, and the configuration file.

   ```
   mkdir agent_demo
   cd agent_demo
   ```

   Copy the contents of the release artifacts that you stored in your Amazon S3 bucket to your device:

   ```
   # Copy release artifacts 
   aws s3 cp s3://<bucket-name>/agent_demo/ ./ --recursive
   ```

   (The contents of the release artifact was stored in a directory called `agent_demo` in a previous step). Replace `<bucket-name>` and `agent_demo` with the name of your Amazon S3 bucket and the file path to your release artifacts, respectively.

   Go the `/bin` directory and make the binary files executable:

   ```
   cd bin
   
   chmod +x sagemaker_edge_agent_binary
   chmod +x sagemaker_edge_agent_client_example
   
   cd agent_demo
   ```

   Make a directory to store your AWS IoT credentials and copy your credentials from your Amazon S3 bucket to your edge device (use the same on you define in the variable `bucket`:

   ```
   mkdir iot-credentials
   cd iot-credentials
   
   aws s3 cp s3://<bucket-name>/authorization-files/AmazonRootCA1.pem ./
   aws s3 cp s3://<bucket-name>/authorization-files/device.pem.crt ./
   aws s3 cp s3://<bucket-name>/authorization-files/private.pem.key ./
   
   cd ../
   ```

   Make a directory to store your model signing root certificates:

   ```
   mkdir certificates
   
   cd certificates
   
   aws s3 cp s3://<bucket-name>/authorization-files/us-west-2.pem ./
   
   cd agent_demo
   ```

   Copy your configuration file to your device:

   ```
   #Download config file from S3
   aws s3 cp s3://<bucket-name>/sagemaker_edge_config.json ./
   
   cd agent_demo
   ```

   Your `agent_demo` directory on your edge device should look similar to the following:

   ```
   ├──agent_demo
   |    ├── bin
   |        ├── sagemaker_edge_agent_binary
   |        └── sagemaker_edge_agent_client_example
   |    ├── sagemaker_edge_config.json
   |    ├── certificates
   |        └──us-west-2.pem
   |    ├── iot-credentials
   |        ├── AmazonRootCA1.pem
   |        ├── device.pem.crt
   |        └── private.pem.key
   |    ├── docs
   |        ├── api
   |        └── examples
   |    ├── ATTRIBUTIONS.txt
   |    ├── LICENSE.txt  
   |    └── RELEASE_NOTES.md
   ```

# Run Agent
<a name="edge-getting-started-step5"></a>

In this section you will run the agent as a binary using gRPC, and check that both your device and fleet are working and collecting sample data.

1. **Launch the agent.**

   The SageMaker Edge Manager agent can be run as a standalone process in the form of an Executable and Linkable Format (ELF) executable binary or can be linked against as a Dynamic Shared Object (.dll). Running as a standalone executable binary is the preferred mode and is supported on Linux.

   This example uses gRPC to run the agent. gRPC is an open source high-performance Remote Procedure Call (RPC) framework that can run in any environment. For more information about gRPC, see the [gRPC documentation](https://grpc.io/docs/).

   To use gRPC, perform the following steps: 

   1. Define a service in a .proto file.

   1. Generate server and client code using the protocol buffer compiler.

   1. Use the Python (or other languages supported by gRPC) gRPC API to write the server for your service.

   1. Use the Python (or other languages supported by gRPC) gRPC API to write a client for your service. 

   The release artifact you downloaded contains a gRPC application ready for you to run the agent. The example is located within the `/bin` directory of your release artifact. The `sagemaker_edge_agent_binary` binary executable is in this directory.

   To run the agent with this example, provide the path to your socket file (.sock) and JSON .config file:

   ```
   ./bin/sagemaker_edge_agent_binary -a /tmp/sagemaker_edge_agent_example.sock -c sagemaker_edge_config.json
   ```

1. **Check your device.**

   Check that your device is connected and sampling data. Making periodic checks, manually or automatically, allows you to check that your device or fleet is working properly.

   Provide the name of the fleet to which the device belongs and the unique device identifier. From your local machine, run the following:

   ```
   sagemaker_client.describe_device(
       DeviceName=device_name,
       DeviceFleetName=device_fleet_name
   )
   ```

   For the given model, you can see the name, model version, latest sample time, and when the last inference was made.

   ```
   { 
     "DeviceName": "sample-device",
     "DeviceFleetName": "demo-device-fleet",
     "IoTThingName": "sample-thing-name-1",
     "RegistrationTime": 1600977370,
     "LatestHeartbeat": 1600977370,
     "Models":[
       {
           "ModelName": "mobilenet_v2.tar.gz", 
           "ModelVersion": "1.1",
           "LatestSampleTime": 1600977370,
           "LatestInference": 1600977370 
       }
     ]
   }
   ```

   The timestamp provided by `LastetHeartbeat` indicates the last signal that was received from the device. `LatestSampleTime` and `LatestInference` describe the time stamp of the last data sample and inference, respectively.

1. **Check your fleet.**

   Check that your fleet is working with `GetDeviceFleetReport`. Provide the name of the fleet the device belongs to.

   ```
   sagemaker_client.get_device_fleet_report(
       DeviceFleetName=device_fleet_name
   )
   ```

   For a given model, you can see the name, model version, latest sample time, and when the last inference was made, along with the Amazon S3 bucket URI where the data samples are stored.

   ```
   # Sample output
   {
    "DeviceFleetName": "sample-device-fleet",
    "DeviceFleetArn": "arn:aws:sagemaker:us-west-2:9999999999:device-fleet/sample-fleet-name",
    "OutputConfig": {
                 "S3OutputLocation": "s3://fleet-bucket/package_output",
     },
    "AgentVersions":[{"Version": "1.1", "AgentCount": 2}]}
    "DeviceStats": {"Connected": 2, "Registered": 2}, 
    "Models":[{
               "ModelName": "sample-model", 
               "ModelVersion": "1.1",
               "OfflineDeviceCount": 0,
               "ConnectedDeviceCount": 2,
               "ActiveDeviceCount": 2, 
               "SamplingDeviceCount": 100
               }]
   }
   ```

# Setup for Devices and Fleets in SageMaker Edge Manager
<a name="edge-device-fleet"></a>

Fleets are collections of logically grouped devices you can use to collect and analyze data. You can use SageMaker Edge Manager to operate machine learning models on a fleet of smart cameras, smart speakers, robots, and other edge devices.

Create a fleet and register your devices either programmatically with the AWS SDK for Python (Boto3) or through the SageMaker AI console.

**Topics**
+ [

# Create a Fleet
](edge-device-fleet-create.md)
+ [

# Register a Device
](edge-device-fleet-register.md)
+ [

# Check Status
](edge-device-fleet-check-status.md)

# Create a Fleet
<a name="edge-device-fleet-create"></a>

You can create a fleet programmatically with the AWS SDK for Python (Boto3) or through the SageMaker AI console [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker/).

## Create a Fleet (Boto3)
<a name="edge-device-fleet-create-boto3"></a>

Use the `CreateDeviceFleet` API to create a fleet. Specify a name for the fleet, your AWS IoT Role ARN for the `RoleArn` field, as well as an Amazon S3 URI where you want the device to store sampled data.

You can optionally include a description of the fleet, tags, and an AWS KMS Key ID.

```
import boto3

# Create SageMaker client so you can interact and manage SageMaker resources
sagemaker_client = boto3.client("sagemaker", region_name="aws-region")

sagemaker_client.create_device_fleet(
    DeviceFleetName="sample-fleet-name",
    RoleArn="arn:aws:iam::999999999:role/rolename", # IoT Role ARN
    Description="fleet description",
    OutputConfig={
        S3OutputLocation="s3://bucket/",
        KMSKeyId: "1234abcd-12ab-34cd-56ef-1234567890ab",
    },
        Tags=[
        {
            "Key": "string", 
            "Value" : "string"
         }
     ],
)
```

An AWS IoT Role Alias is created for you when you create a device fleet. The AWS IoT role alias provides a mechanism for connected devices to authenticate to AWS IoT using X.509 certificates and then obtain short-lived AWS credentials from an IAM role that is associated with the AWS IoT role alias.

Use `DescribeDeviceFleet` to get the role alias name and ARN.

```
# Print Amazon Resource Name (ARN) and alias that has access 
# to AWS Internet of Things (IoT).
sagemaker_client.describe_device_fleet(DeviceFleetName=device_fleet_name)['IotRoleAlias']
```

Use `DescribeDeviceFleet` API to get a description of fleets you created.

```
sagemaker_client.describe_device_fleet(
    DeviceFleetName="sample-fleet-name"
)
```

By default, it returns the name of the fleet, the device fleet ARN, the Amazon S3 bucket URI, the IAM role, the role alias created in AWS IoT, a timestamp of when the fleet was created, and a timestamp of when the fleet was last modified.

```
{ "DeviceFleetName": "sample-fleet-name",
  "DeviceFleetArn": "arn:aws:sagemaker:us-west-2:9999999999:device-fleet/sample-fleet-name",
  "IAMRole": "arn:aws:iam::999999999:role/rolename",
  "Description": "this is a sample fleet",
  "IoTRoleAlias": "arn:aws:iot:us-west-2:9999999999:rolealias/SagemakerEdge-sample-fleet-name"
  "OutputConfig": {
              "S3OutputLocation": "s3://bucket/folder",
              "KMSKeyId": "1234abcd-12ab-34cd-56ef-1234567890ab"
   },
   "CreationTime": "1600977370",
   "LastModifiedTime": "1600977370"}
```

## Create a Fleet (Console)
<a name="edge-device-fleet-create-console"></a>

You can create a Edge Manager packaging job using the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker/).

1. In the SageMaker AI console, choose **Edge Manager** and then choose **Edge device fleets**.

1. Choose **Create device fleet**.  
![\[The locaiton of the Create device fleet in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-device-button-edited.png)

1. Enter a name for the device fleet in the **Device fleet name** field. Choose **Next**.  
![\[The location of the Next button in the Device fleet properties section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-device-fleet-filled.png)

1. On the **Output configuration** page, specify the Amazon S3 bucket URI where you want to store sample data from your device fleet. You can optionally add an encryption key as well by electing an existing AWS KMS key from the dropdown list or by entering a key’s ARN. Choose **Submit**.  
![\[Example Output configuration page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-device-fleet-output-filled.png)

1. Choose the name of your device fleet to be redirected to the device fleet details. This page displays the name of the device fleet, ARN, description (if you provided one), date the fleet was created, last time the fleet was modified, Amazon S3 bucket URI, AWS KMS key ID (if provided), AWS IoT alias (if provided), and IAM role. If you added tags, they appear in the **Device fleet tags** section.

# Register a Device
<a name="edge-device-fleet-register"></a>

**Important**  
Device registration is required to use any part of SageMaker Edge Manager.

You can create a fleet programmatically with the AWS SDK for Python (Boto3) or through the SageMaker AI console at [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker/).

## Register a Device (Boto3)
<a name="edge-device-fleet-register-boto3"></a>

To register your device, first create and register an AWS IoT thing object and configure an IAM role. SageMaker Edge Manager takes advantage of the AWS IoT Core services to facilitate the connection between the edge devices and the cloud. You can take advantage of existing AWS IoT functionality after you set up your devices to work with Edge Manager.

To connect your device to AWS IoT you need to create AWS IoT thing objects, create and register a client certificate with AWS IoT, and create and configure IAM role for your devices.

See the [Getting Started Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/edge-manager-getting-started.html) for an in-depth example or the [Explore AWS IoT Core services in hands-on tutorial](https://docs.aws.amazon.com/iot/latest/developerguide/iot-gs-first-thing.html).

Use the `RegisterDevices` API to register your device. Provide the name of the fleet of which you want the devices to be a part, as well as a name for the device. You can optionally add a description to the device, tags, and AWS IoT thing name associated with the device.

```
sagemaker_client.register_devices(
    DeviceFleetName="sample-fleet-name",
    Devices=[
        {          
            "DeviceName": "sample-device-1",
            "IotThingName": "sample-thing-name-1",
            "Description": "Device #1"
        }
     ],
     Tags=[
        {
            "Key": "string", 
            "Value" : "string"
         }
     ],
)
```

## Register a Device (Console)
<a name="edge-device-fleet-register-console"></a>

You can register your device using the SageMaker AI console at [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker/).

1. In the SageMaker AI console, choose **Edge Inference** and then choose **Edge devices**.

1. Choose **Register devices**.  
![\[Location of Register devices in the Edge Devices section of the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/register-device-button.png)

1. In the **Device properties** section, enter the name of the fleet the device belongs to under the **Device fleet name** field. Choose **Next**.  
![\[The Device properties section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/register-devices-empty.png)

1. In the **Device source** section, add your devices one by one. You must include a **Device Name** for each device in your fleet. You can optionally provide a description (in the **Description** field) and an Internet of Things (IoT) object name (in the **IoT name** field). Choose **Submit** once you have added all your devices.  
![\[The Device source section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/register-devices-device-source.png)

   The **Devices** page displays the name of the device you have added, the fleet to which it belongs, when it was registered, the last heartbeat, and the description and AWS IoT name, if you provided one.

   Choose a device to view the device’s details, including the device name, fleet, ARN, description, IoT Thing name, when the device was registered, and the last heartbeat.

# Check Status
<a name="edge-device-fleet-check-status"></a>

Check that your device or fleet is connected and sampling data. Making periodic checks, manually or automatically, allows you to check that your device or fleet is working properly.

Use the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/) to interactively choose a fleet for a status check. You can also use the AWS SDK for Python (Boto3). The following describes different APIs from Boto3 you can use to check the status of your device or fleet. Use the API that best fits your use case.
+ **Check an individual device.**

  To check the status of an individual device, use `DescribeDevice` API. A list containing one or more models is provided if a models have been deployed to the device.

  ```
  sagemaker_client.describe_device(
      DeviceName="sample-device-1",
      DeviceFleetName="sample-fleet-name"
  )
  ```

  Running `DescribeDevice` returns:

  ```
  { "DeviceName": "sample-device".
    "Description": "this is a sample device",
    "DeviceFleetName": "sample-device-fleet",
    "IoTThingName": "SampleThing",
    "RegistrationTime": 1600977370,
    "LatestHeartbeat": 1600977370,
    "Models":[
          {
           "ModelName": "sample-model", 
           "ModelVersion": "1.1",
           "LatestSampleTime": 1600977370,
           "LatestInference": 1600977370 
          }
     ]
  }
  ```
+ **Check a fleet of devices.**

  To check the status of the fleet, use the `GetDeviceFleetReport` API. Provide the name of the device fleet to get a summary of the fleet.

  ```
  sagemaker_client.get_device_fleet_report(
      DeviceFleetName="sample-fleet-name"
  )
  ```
+ **Check for a heartbeat.**

  Each device within a fleet periodically generates a signal, or “heartbeat”. The heartbeat can be used to check that the device is communicating with Edge Manager. If the timestamp of the last heartbeat is not being updated, the device may be failing.

  Check the last heartbeat with made by a device with the `DescribeDevice` API. Specify the name of the device and the fleet to which the edge device belongs.

  ```
  sagemaker_client.describe_device(
      DeviceName="sample-device-1",
      DeviceFleetName="sample-fleet-name"
  )
  ```

# How to Package Model
<a name="edge-packaging-job"></a>

SageMaker Edge Manager packaging jobs take Amazon SageMaker Neo–compiled models and make any changes necessary to deploy the model with the inference engine, Edge Manager agent.

**Topics**
+ [

# Complete prerequisites
](edge-packaging-job-prerequisites.md)
+ [

# Package a Model (Amazon SageMaker AI Console)
](edge-packaging-job-console.md)
+ [

# Package a Model (Boto3)
](edge-packaging-job-boto3.md)

# Complete prerequisites
<a name="edge-packaging-job-prerequisites"></a>

To package a model, you must do the following:

1. **Compile your machine learning model with SageMaker AI Neo.**

   If you have not already done so, compile your model with SageMaker Neo. For more information on how to compile your model, see [Compile and Deploy Models with Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html). If you are first-time user of SageMaker Neo, go through [Getting Started with Neo Edge Devices](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-getting-started-edge.html).

1. **Get the name of your compilation job.**

   Provide the name of the compilation job name you used when you compiled your model with SageMaker Neo. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and choose **Compilation jobs** to find a list of compilations that have been submitted to your AWS account. The names of submitted compilation jobs are in the **Name** column.

1. **Get your IAM ARN.**

   You need an Amazon Resource Name (ARN) of an IAM role that you can use to download and upload the model and contact SageMaker Neo.

   Use one of the following methods to get your IAM ARN:
   + **Programmatically with the SageMaker AI Python SDK**

     ```
     import sagemaker
     
     # Initialize SageMaker Session object so you can interact with AWS resources
     sess = sagemaker.Session()
     
     # Get the role ARN 
     role = sagemaker.get_execution_role()
     
     print(role)
     >> arn:aws:iam::<your-aws-account-id>:role/<your-role-name>
     ```

     For more information about using the SageMaker Python SDK, see the [SageMaker AI Python SDK API](https://sagemaker.readthedocs.io/en/stable/index.html).
   + **Using the AWS Identity and Access Management (IAM) console**

     Navigate to the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/). In the IAM **Resources** section, choose **Roles** to view a list of roles in your AWS account. Select or create a role that has `AmazonSageMakerFullAccess`, `AWSIoTFullAccess`, and `AmazonS3FullAccess`.

     For more information on IAM, see [What is IAM?](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html)

1. **Have an S3 bucket URI.**

   You need to have at least one Amazon Simple Storage Service (Amazon S3) bucket URI to store your Neo-compiled model, the output of the Edge Manager packaging job, and sample data from your device fleet.

   Use one of the following methods to create an Amazon S3 bucket:
   + **Programmatically with the SageMaker AI Python SDK**

     You can use the default Amazon S3 bucket during a session. A default bucket is created based on the following format: `sagemaker-{region}-{aws-account-id}`. To create a default bucket with the SageMaker Python SDK, use the following:

     ```
     import sagemaker
     
     session=sagemaker.create_session()
     
     bucket=session.default_bucket()
     ```
   + **Using the Amazon S3 console**

     Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/) and see [How do I create an S3 Bucket?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html) for step-by-step instructions.

# Package a Model (Amazon SageMaker AI Console)
<a name="edge-packaging-job-console"></a>

You can create a SageMaker Edge Manager packaging job using the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). Before continuing, make sure you have satisfied the [Complete prerequisites](edge-packaging-job-prerequisites.md).

1. In the SageMaker AI console, choose **Edge Inference** and then choose **Create edge packaging jobs**, as shown in the following image.  
![\[Location of Create edge packaging jobs in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/pre-edge-packaging-button-edited.png)

1. On the **Job properties** page, enter a name for your packaging job under **Edge packaging job name**. Note that Edge Manager packaging job names are case-sensitive. Name your model and give it a version: enter this under **Model name** and **Model version**, respectively.

1. Next, select an **IAM role**. You can chose a role or let AWS create a role for you. You can optionally specify a **resource key ARN** and **job tags**.

1. Choose **Next**.   
![\[Example of the Job properties section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-edge-packaging-job-filled.png)

1. Specify the name of the compilation job you used when compiling your model with SageMaker Neo in the **Compilation job name** field. Choose **Next**.  
![\[Example of the Model source section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-edge-packaging-job-model-source-filled.png)

1. On the **Output configuration** page, enter the Amazon S3 bucket URI in which you want to store the output of the packaging job.  
![\[Example Output configuration page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smith/create-device-fleet-output-filled.png)

   The **Status** column on the **Edge packaging** jobs page should read **IN PROGRESS**. Once the packaging job is complete, the status updates to **COMPLETED**.

   Selecting a packaging job directs you to that job's settings. The **Job settings** section displays the job name, ARN, status, creation time, last modified time, duration of the packaging job, and role ARN.

   The **Input configuration** section displays the location of the model artifacts, the data input configuration, and the machine learning framework of the model.

   The **Output configuration** section displays the output location of the packaging job, the target device for which the model was compiled, and any tags you created.

1. Choose the name of your device fleet to be redirected to the device fleet details. This page displays the name of the device fleet, ARN, description (if you provided one), date the fleet was created, last time the fleet was modified, Amazon S3 bucket URI, AWS KMS key ID (if provided), AWS IoT alias (if provided), and IAM role. If you added tags, they appear in the **Device fleet tags** section.

# Package a Model (Boto3)
<a name="edge-packaging-job-boto3"></a>

You can create a SageMaker Edge Manager packaging job with the AWS SDK for Python (Boto3). Before continuing, make sure you have satisfied the [Complete prerequisites](edge-packaging-job-prerequisites.md).

To request an edge packaging job, use `CreateEdgePackagingJob`. You need to provide a name to your edge packaging job, the name of your SageMaker Neo compilation job, your role Amazon resource name (ARN), a name for your model, a version for your model, and the Amazon S3 bucket URI where you want to store the output of your packaging job. Note that Edge Manager packaging job names and SageMaker Neo compilation job names are case-sensitive.

```
# Import AWS SDK for Python (Boto3)
import boto3

# Create Edge client so you can submit a packaging job
sagemaker_client = boto3.client("sagemaker", region_name='aws-region')

sagemaker_client.create_edge_packaging_job(
    EdgePackagingJobName="edge-packaging-name",
    CompilationJobName="neo-compilation-name",
    RoleArn="arn:aws:iam::99999999999:role/rolename",
    ModelName="sample-model-name",
    ModelVersion="model-version",
    OutputConfig={
        "S3OutputLocation": "s3://your-bucket/",
    }
)
```

You can check the status of an edge packaging job using `DescribeEdgePackagingJob` and providing the case-sensitive edge packaging job name:

```
response = sagemaker_client.describe_edge_packaging_job(
                                    EdgePackagingJobName="edge-packaging-name")
```

This returns a dictionary that can be used to poll the status of the packaging job:

```
# Optional - Poll every 30 sec to check completion status
import time

while True:
    response = sagemaker_client.describe_edge_packaging_job(
                                         EdgePackagingJobName="edge-packaging-name")
    
    if response['EdgePackagingJobStatus'] == 'Completed':
        break
    elif response['EdgePackagingJobStatus'] == 'Failed':
        raise RuntimeError('Packaging job failed')
    print('Packaging model...')
    time.sleep(30)
print('Done!')
```

For a list of packaging jobs, use `ListEdgePackagingJobs`. You can use this API to search for a specific packaging job. Provide a partial name to filter packaging job names for `NameContains`, a partial name for `ModelNameContains` to filter for jobs in which the model name contains the name you provide. Also specify with which column to sort with `SortBy`, and by which direction to sort for `SortOrder` (either `Ascending` or `Descending`).

```
sagemaker_client.list_edge_packaging_jobs(
    "NameContains": "sample",
    "ModelNameContains": "sample",
    "SortBy": "column-name",
    "SortOrder": "Descending"
)
```

To stop a packaging job, use `StopEdgePackagingJob` and provide the name of your edge packaging job.

```
sagemaker_client.stop_edge_packaging_job(
        EdgePackagingJobName="edge-packaging-name"
)
```

For a full list of Edge Manager APIs, see the [Boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html).

# The Edge Manager Agent
<a name="edge-device-fleet-about"></a>

The Edge Manager agent is an inference engine for your edge devices. Use the agent to make predictions with models loaded onto your edge devices. The agent also collects model metrics and captures data at specific intervals. Sample data is stored in your Amazon S3 bucket.

There are two methods of installing and deploying the Edge Manager agent onto your edge devices:

1. Download the agent as a binary from the Amazon S3 release bucket. For more information, see [Download and Set Up the Edge Manager Agent Manually](edge-device-fleet-manual.md).

1. Use the AWS IoT Greengrass V2 console or the AWS CLI to deploy `aws.greengrass.SageMakerEdgeManager`. See [Create the AWS IoT Greengrass V2 Components](edge-greengrass-custom-component.md).

# Download and Set Up the Edge Manager Agent Manually
<a name="edge-device-fleet-manual"></a>

Download the Edge Manager agent based on your operating system, architecture, and AWS Region. The agent is periodically updated, so you have the option to choose your agent based on release dates and versions. Once you have the agent, create a JSON configuration file. Specify the device IoT thing name, fleet name, device credentials, and other key-value pairs. See [Running the Edge Manager agent](#edge-device-fleet-running-agent) for full a list of keys you must specify in the configuration file. You can run the agent as an executable binary or link against it as a dynamic shared object (DSO).

## How the agent works
<a name="edge-device-fleet-how-agent-works"></a>

The agent runs on the CPU of your devices. The agent runs inference on the framework and hardware of the target device you specified during the compilation job. For example, if you compiled your model for the Jetson Nano, the agent supports the GPU in the provided [Deep Learning Runtime](https://github.com/neo-ai/neo-ai-dlr) (DLR).

The agent is released in binary format for supported operating systems. Check that your operating system is supported and meets the minimum OS requirement in the following table:

------
#### [ Linux ]

**Version:** Ubuntu 18.04

**Supported Binary Formats:** x86-64 bit (ELF binary) and ARMv8 64 bit (ELF binary)

------
#### [ Windows ]

**Version:** Windows 10 version 1909

**Supported Binary Formats:** x86-32 bit (DLL) and x86-64 bit (DLL)

------

## Installing the Edge Manager agent
<a name="edge-device-fleet-installation"></a>

To use the Edge Manager agent, you first must obtain the release artifacts and a root certificate. The release artifacts are stored in an Amazon S3 bucket in the `us-west-2` Region. To download the artifacts, specify your operating system (`<OS>`) and the `<VERSION>`.

Based on your operating system, replace `<OS>` with one of the following:


| Windows 32-bit | Windows 64-bit | Linux x86-64 | Linux ARMv8 | 
| --- | --- | --- | --- | 
| windows-x86 | windows-x64 | linux-x64 | linux-armv8 | 

The `VERSION` is broken into three components: `<MAJOR_VERSION>.<YYYY-MM-DD>-<SHA-7>`, where:
+ `<MAJOR_VERSION>`: The release version. The release version is currently set to `1`.
+ `<YYYY-MM-DD>`: The time stamp of the artifacts release.
+ `<SHA-7>`: The repository commit ID from which the release is built.

You must provide the `<MAJOR_VERSION>` and the time stamp in `YYYY-MM-DD` format. We suggest you use the latest artifact release time stamp.

Run the following in your command line to get the latest time stamp. Replace `<OS>` with your operating system:

```
aws s3 ls s3://sagemaker-edge-release-store-us-west-2-<OS>/Releases/ | sort -r
```

For example, if you have a Windows 32-bit OS, run:

```
aws s3 ls s3://sagemaker-edge-release-store-us-west-2-windows-x86/Releases/ | sort -r
```

This returns:

```
2020-12-01 23:33:36 0 

                    PRE 1.20201218.81f481f/
                    PRE 1.20201207.02d0e97/
```

The return output in this example shows two release artifacts. The first release artifact file notes that the release version has a major release version of `1`, a time stamp of `20201218` (in YYYY-MM-DD format), and a `81f481f` SHA-7 commit ID.

**Note**  
The preceding command assumes you have configured the AWS Command Line Interface. For more information, about how to configure the settings that the AWS CLI uses to interact with AWS, see [Configuring the AWS CLI](https://docs.aws.amazon.com//cli/latest/userguide/cli-chap-configure.html).

Based on your operating system, use the following commands to install the artifacts:

------
#### [ Windows 32-bit ]

```
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-windows-x86/Releases/<VERSION>/<VERSION>.zip .
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-windows-x86/Releases/<VERSION>/sha256_hex.shasum .
```

------
#### [ Windows 64-bit ]

```
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-windows-x64/Releases/<VERSION>/<VERSION>.zip .
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-windows-x64/Releases/<VERSION>/sha256_hex.shasum .
```

------
#### [ Linux x86-64 ]

```
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-x64/Releases/<VERSION>/<VERSION>.tgz .
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-x64/Releases/<VERSION>/sha256_hex.shasum .
```

------
#### [ Linux ARMv8 ]

```
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-armv8/Releases/<VERSION>/<VERSION>.tgz .
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-linux-armv8/Releases/<VERSION>/sha256_hex.shasum .
```

------

You also must download a root certificate. This certificate validates model artifacts signed by AWS before loading them onto your edge devices.

Replace `<OS>` corresponding to your platform from the list of supported operation systems and replace `<REGION>` with your AWS Region.

```
aws s3 cp s3://sagemaker-edge-release-store-us-west-2-<OS>/Certificates/<REGION>/<REGION>.pem .
```

## Running the Edge Manager agent
<a name="edge-device-fleet-running-agent"></a>

You can run the SageMaker AI Edge Manager agent as a standalone process in the form of an Executable and Linkable Format (ELF) executable binary or you can link against it as a dynamic shared object (.dll). Linux supports running it as a standalone executable binary and is the preferred mode. Windows supports running it as a shared object (.dll).

On Linux, we recommend that you run the binary via a service that’s a part of your initialization (`init`) system. If you want to run the binary directly, you can do so in a terminal as shown in the following example. If you have a modern OS, there are no other installations necessary prior to running the agent, since all the requirements are statically built into the executable. This gives you flexibility to run the agent on the terminal, as a service, or within a container.

To run the agent, first create a JSON configuration file. Specify the following key-value pairs:
+ `sagemaker_edge_core_device_name`: The name of the device. This device name needs to be registered along with the device fleet in the SageMaker Edge Manager console.
+ `sagemaker_edge_core_device_fleet_name`: The name of the fleet to which the device belongs.
+ `sagemaker_edge_core_region`: The AWS Region associated with the device, the fleet and the Amazon S3 buckets. This corresponds to the Region where the device is registered and where the Amazon S3 bucket is created (they are expected to be the same). The models themselves can be compiled with SageMaker Neo in a different Region, this configuration is not related to model compilation Region.
+ `sagemaker_edge_core_root_certs_path`: The absolute folder path to root certificates. This is used to validate the device with the relevant AWS account.
+ `sagemaker_edge_provider_aws_ca_cert_file`: The absolute path to Amazon Root CA certificate (AmazonRootCA1.pem). This is used to validate the device with the relevant AWS account. `AmazonCA` is a certificate owned by AWS.
+ `sagemaker_edge_provider_aws_cert_file`: The absolute path to AWS IoT signing root certificate (`*.pem.crt`).
+ `sagemaker_edge_provider_aws_cert_pk_file`: The absolute path to AWS IoT private key. (`*.pem.key`).
+ `sagemaker_edge_provider_aws_iot_cred_endpoint`: The AWS IoT credentials endpoint (*identifier*.iot.*region*.amazonaws.com). This endpoint is used for credential validation. See [Connecting devices to AWS IoT](https://docs.aws.amazon.com/iot/latest/developerguide/iot-connect-devices.html) for more information.
+ `sagemaker_edge_provider_provider`: This indicates the implementation of the provider interface being used. The provider interface communicates with the end network services for uploads, heartbeats and registration validation. By default this is set to `"Aws"`. We allow custom implementations of the provider interface. It can be set to `None` for no provider or `Custom` for custom implementation with the relevant shared object path provided.
+ `sagemaker_edge_provider_provider_path`: Provides the absolute path to the provider implementation shared object. (.so or .dll file). The `"Aws"` provider .dll or .so file is provided with the agent release. This field is mandatory.
+ `sagemaker_edge_provider_s3_bucket_name`: The name of your Amazon S3 bucket (not the Amazon S3 bucket URI). The bucket must have a `sagemaker` string within its name.
+ `sagemaker_edge_log_verbose` (Boolean.): Optional. This sets the debug log. Select either `True` or `False`.
+ `sagemaker_edge_telemetry_libsystemd_path`: For Linux only, `systemd` implements the agent crash counter metric. Set the absolute path of libsystemd to turn on the crash counter metric. You can find the default libsystemd path can be found by running `whereis libsystemd` in the device terminal.
+ `sagemaker_edge_core_capture_data_destination`: The destination for uploading capture data. Choose either `"Cloud"` or `"Disk"`. The default is set to `"Disk"`. Setting it to `"Disk"` writes the input and output tensor(s) and auxiliary data to the local file system at your preferred location of. When writing to `"Cloud"` use the Amazon S3 bucket name provided in the `sagemaker_edge_provider_s3_bucket_name` configuration.
+ `sagemaker_edge_core_capture_data_disk_path`: Set the absolute path in the local file system, into which capture data files are written when `"Disk"` is the destination. This field is not used when `"Cloud"` is specified as the destination.
+ `sagemaker_edge_core_folder_prefix`: The parent prefix in Amazon S3 where captured data is stored when you specify `"Cloud"` as the capture data destination (`sagemaker_edge_core_capture_data_disk_path)`. Captured data is stored in a sub-folder under `sagemaker_edge_core_capture_data_disk_path` if `"Disk"` is set as the data destination.
+ `sagemaker_edge_core_capture_data_buffer_size` (Integer value) : The capture data circular buffer size. It indicates the maximum number of requests stored in the buffer.
+ `sagemaker_edge_core_capture_data_batch_size` (Integer value): The capture data batch size. It indicates the size of a batch of requests that are handled from the buffer. This value must to be less than `sagemaker_edge_core_capture_data_buffer_size`. A maximum of half the size of the buffer is recommended for batch size.
+ `sagemaker_edge_core_capture_data_push_period_seconds` (Integer value): The capture data push period in seconds. A batch of requests in the buffer is handled when there are batch size requests in the buffer, or when this time period has completed (whichever comes first). This configuration sets that time period.
+ `sagemaker_edge_core_capture_data_base64_embed_limit`: The limit for uploading capture data in bytes. Integer value.

Your configuration file should look similar to the following example(with your specific values specified). This example uses the default AWS provider(`"Aws"`) and does not specify a periodic upload.

```
{
    "sagemaker_edge_core_device_name": "device-name",
    "sagemaker_edge_core_device_fleet_name": "fleet-name",
    "sagemaker_edge_core_region": "region",
    "sagemaker_edge_core_root_certs_path": "<Absolute path to root certificates>",
    "sagemaker_edge_provider_provider": "Aws",
    "sagemaker_edge_provider_provider_path" : "/path/to/libprovider_aws.so",
    "sagemaker_edge_provider_aws_ca_cert_file": "<Absolute path to Amazon Root CA certificate>/AmazonRootCA1.pem",
    "sagemaker_edge_provider_aws_cert_file": "<Absolute path to AWS IoT signing root certificate>/device.pem.crt",
    "sagemaker_edge_provider_aws_cert_pk_file": "<Absolute path to AWS IoT private key.>/private.pem.key",
    "sagemaker_edge_provider_aws_iot_cred_endpoint": "https://<AWS IoT Endpoint Address>",
    "sagemaker_edge_core_capture_data_destination": "Cloud",
    "sagemaker_edge_provider_s3_bucket_name": "sagemaker-bucket-name",
    "sagemaker_edge_core_folder_prefix": "Amazon S3 folder prefix",
    "sagemaker_edge_core_capture_data_buffer_size": 30,
    "sagemaker_edge_core_capture_data_batch_size": 10,
    "sagemaker_edge_core_capture_data_push_period_seconds": 4000,
    "sagemaker_edge_core_capture_data_base64_embed_limit": 2,
    "sagemaker_edge_log_verbose": false
}
```

The release artifact includes a binary executable called `sagemaker_edge_agent_binary` in the `/bin` directory. To run the binary, use the `-a` flag to create a socket file descriptor (.sock) in a directory of your choosing and specify the path of the agent JSON config file you created with the `-c` flag.

```
./sagemaker_edge_agent_binary -a <ADDRESS_TO_SOCKET> -c <PATH_TO_CONFIG_FILE>
```

The following example shows the code snippet with a directory and file path specified:

```
./sagemaker_edge_agent_binary -a /tmp/sagemaker_edge_agent_example.sock -c sagemaker_edge_config.json
```

In this example, a socket file descriptor named `sagemaker_edge_agent_example.sock` is created in the `/tmp` directory and points to a configuration file that is in the same working directory as the agent called `sagemaker_edge_config.json`.

# Model Package and Edge Manager Agent Deployment with AWS IoT Greengrass
<a name="edge-greengrass"></a>

SageMaker Edge Manager integrates AWS IoT Greengrass version 2 to simplify accessing, maintaining, and deploying the Edge Manager agent and model to your devices. Without AWS IoT Greengrass V2, setting up your devices and fleets to use SageMaker Edge Manager requires you to manually copy the Edge Manager agent from an Amazon S3 release bucket. You use the agent to make predictions with models loaded onto your edge devices. With AWS IoT Greengrass V2 and SageMaker Edge Manager integration, you can use AWS IoT Greengrass V2 components. Components are pre-built software modules that can connect your edge devices to AWS services or third-party service via AWS IoT Greengrass.

You must install the AWS IoT Greengrass Core software onto your device(s) if you want to use AWS IoT Greengrass V2 to deploy the Edge Manager agent and your model. For more information about device requirements and how to set up your devices, see [Setting up AWS IoT Greengrass core devices](https://docs.aws.amazon.com/greengrass/v2/developerguide/setting-up.html) in the AWS IoT Greengrass documentation.

You use the following three components to deploy the Edge Manager agent:
+ *A pre-built public component*: SageMaker AI maintains the public Edge Manager component.
+ *A autogenerated private component*: The private component is autogenerated when you package your machine learning model with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgePackagingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgePackagingJob.html) API and specify `GreengrassV2Component` for the Edge Manager API field `PresetDeploymentType`.
+ *A custom component*: This is the inference application that is responsible for preprocessing and making inferences on your device. You must create this component. See either [Create a Hello World custom component](edge-greengrass-custom-component.md#edge-greengrass-create-custom-component-how) in the SageMaker Edge Manager documentation or [Create custom AWS IoT Greengrass components](https://docs.aws.amazon.com/greengrass/v2/developerguide/create-components.html) in the AWS IoT Greengrass documentation for more information on how to create custom components.

# Complete prerequisites to deploy the Edge Manager agent
<a name="edge-greengrass-prerequisites"></a>

SageMaker Edge Manager uses AWS IoT Greengrass V2 to simplify the deployment of the Edge Manager agent, your machine learning models, and your inference application to your devices with the use of components. To make it easier to maintain your AWS IAM roles, Edge Manager allows you to reuse your existing AWS IoT role alias. If you do not have one yet, Edge Manager generates a role alias as part of the Edge Manager packaging job. You no longer need to associate a role alias generated from the SageMaker Edge Manager packaging job with your AWS IoT role. 

Before you start, you must complete the following prerequisites:

1. Install the AWS IoT Greengrass Core software. For detailed information, see [Install the AWS IoT Greengrass Core software](https://docs.aws.amazon.com/greengrass/v2/developerguide/getting-started.html#install-greengrass-v2).

1. Set up AWS IoT Greengrass V2. For more information, see [Install AWS IoT Greengrass Core software with manual resource provisioning](https://docs.aws.amazon.com/greengrass/v2/developerguide/manual-installation.html).
**Note**  
Make sure the AWS IoT thing name is all lowercase and does not contain characters except (optionally) dashes (`‐`).
The IAM Role must start with `SageMaker*`

1. Attach the following permission and inline policy to the IAM role created during AWS IoT Greengrass V2 setup.
   + Navigate to the IAM console [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).
   + Search for the role you created by typing in rhe role name in the **Search** field.
   + Choose your role.
   + Next, choose **Attach policies**.
   + Search for **AmazonSageMakerEdgeDeviceFleetPolicy**.
   + Select **AmazonSageMakerFullAccess** (This is an optional step that makes it easier for you to reuse this IAM role in model compilation and packaging).
   + Add required permissions to a role's permissions policy, don't attach inline policies to IAM users.

------
#### [ JSON ]

****  

     ```
     {
         "Version":"2012-10-17",		 	 	 
         "Statement":[
           {
             "Sid":"GreengrassComponentAccess",
             "Effect":"Allow",
             "Action":[
                 "greengrass:CreateComponentVersion",
                 "greengrass:DescribeComponent"
             ],
             "Resource":"*"
            }
         ]
     }
     ```

------
   + Choose **Attach policy**.
   + Choose **Trust relationship**.
   + Choose **Edit trust relationship**.
   + Replace the content with the following.

------
#### [ JSON ]

****  

     ```
     {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Service": "credentials.iot.amazonaws.com"
           },
           "Action": "sts:AssumeRole"
         },
         {
           "Effect": "Allow",
           "Principal": {
             "Service": "sagemaker.amazonaws.com"
           },
           "Action": "sts:AssumeRole"
         }
       ]
     }
     ```

------

1. Create an Edge Manager device fleet. For information on how to create a fleet, see [Setup for Devices and Fleets in SageMaker Edge Manager](edge-device-fleet.md).

1. Register your device with the same name as your AWS IoT thing name created during the AWS IoT Greengrass V2 setup.

1. Create at least one custom private AWS IoT Greengrass component. This component is the application that runs inference on the device. For more information, see [Create a Hello World custom component](edge-greengrass-custom-component.md#edge-greengrass-create-custom-component-how)

**Note**  
The SageMaker Edge Manager and AWS IoT Greengrass integration only works for AWS IoT Greengrass v2.
Both your AWS IoT thing name and Edge Manager device name must be the same.
SageMaker Edge Manager does not load local AWS IoT certificates and call the AWS IoT credential provider endpoint directly. Instead, SageMaker Edge Manager uses the AWS IoT Greengrass v2 TokenExchangeService and it fetches a temporary credential from a TES endpoint.

# Create the AWS IoT Greengrass V2 Components
<a name="edge-greengrass-custom-component"></a>

AWS IoT Greengrass uses *components*, a software module that is deployed to and runs on a AWS IoT Greengrass core device. You need (at a minimum) three components:

1. *A public Edge Manager Agent AWS IoT Greengrass component* which deploys the Edge Manager agentbinary.

1. *A model component* that is autogenerated when you package your machine learning model with either the AWS SDK for Python (Boto3) API or with the SageMaker AI console. For information, see [Create an autogenerated component](#edge-greengrass-autogenerate-component-how).

1. *A private, custom component* to implement the Edge Manager agent client application, and do any preprocessing and post-processing of the inference results. For more information about how to create a custom component, see [Create an autogenerated component](#edge-greengrass-autogenerate-component-how) or [Create custom AWS IoT Greengrass components](https://docs.aws.amazon.com/greengrass/v2/developerguide/create-components.html).

## Create an autogenerated component
<a name="edge-greengrass-autogenerate-component-how"></a>

Generate the model component with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgePackagingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgePackagingJob.html) API and specify `GreengrassV2Component` for the SageMaker Edge Manager packaging job API field `PresetDeploymentType`. When you call the `CreateEdgePackagingJob` API, Edge Manager takes your SageMaker AI Neo–compiled model in Amazon S3 and creates a model component. The model component is automatically stored in your account. You can view any of your components by navigating to the AWS IoT console [https://console.aws.amazon.com/iot/](https://console.aws.amazon.com/greengrass/). Select **Greengrass** and then select **Core** devices. The page has a list of AWS IoT Greengrass core devices associated with your account. If a model component name is not specified in `PresetDeploymentConfig`, the default name generated consists of `"SagemakerEdgeManager"` and the name of your Edge Manager agent packaging job. The following example demonstrates how to specify to Edge Manager to create a AWS IoT Greengrass V2 component with the `CreateEdgePackagingJob` API.

```
import sagemaker
import boto3

# Create a SageMaker client object to make it easier to interact with other AWS services.
sagemaker_client = boto3.client('sagemaker', region=<YOUR_REGION>)

# Replace with your IAM Role ARN
sagemaker_role_arn = "arn:aws:iam::<account>:role/*"

# Replace string with the name of your already created S3 bucket.
bucket = 'amzn-s3-demo-bucket-edge-manager'

# Specify a name for your edge packaging job.
edge_packaging_name = "edge_packag_job_demo" 

# Replace the following string with the name you used for the SageMaker Neo compilation job.
compilation_job_name = "getting-started-demo" 

# The name of the model and the model version.
model_name = "sample-model" 
model_version = "1.1"

# Output directory in S3 where you want to store the packaged model.
packaging_output_dir = 'packaged_models' 
packaging_s3_output = 's3://{}/{}'.format(bucket, packaging_output_dir)

# The name you want your Greengrass component to have.
component_name = "SagemakerEdgeManager" + edge_packaging_name

sagemaker_client.create_edge_packaging_job(
                    EdgePackagingJobName=edge_packaging_name,
                    CompilationJobName=compilation_job_name,
                    RoleArn=sagemaker_role_arn,
                    ModelName=model_name,
                    ModelVersion=model_version,
                    OutputConfig={
                        "S3OutputLocation": packaging_s3_output,
                        "PresetDeploymentType":"GreengrassV2Component",
                        "PresetDeploymentConfig":"{\"ComponentName\":\"sample-component-name\", \"ComponentVersion\":\"1.0.2\"}"
                        }
                    )
```

You can also create the autogenerated component with the SageMaker AI console. Follow steps 1-6 in [Package a Model (Amazon SageMaker AI Console)](edge-packaging-job-console.md)

Enter the Amazon S3 bucket URI where you want to store the output of the packaging job and optional encrytion key.

Complete the following to create the model component:

1. Choose **Preset deployment**.

1. Specify the name of the component for the **Component name** field.

1. Optionally, provide a description of the component, a component version, the platform OS, or the platform architecture for the **Component description**, **Component version**, **Platform OS**, and **Platform architecture**, respectively.

1. Choose **Submit**.

## Create a Hello World custom component
<a name="edge-greengrass-create-custom-component-how"></a>

The custom application component is used to perform inference on the edge device. The component is responsible for loading models to SageMaker Edge Manager, invoking the Edge Manager agent for inference, and unloading the model when the component is shut down. Before you create your component, ensure the agent and application can communicate with Edge Manager. To do this, configure [gRPC](https://grpc.io/). The Edge Manager agent uses methods defined in Protobuf Buffers and the gRPC server to establish communication with the client application on the edge device and the cloud.

To use gRPC, you must:

1. Create a gRPC stub using the .proto file provided when you download the Edge Manager agent from Amazon S3 release bucket.

1. Write client code with the language you prefer.

You do not need to define the service in a .proto file. The service .proto files are included in the compressed TAR file when you download the Edge Manager agent release binary from the Amazon S3 release bucket.

Install gRPC and other necessary tools on your host machine and create the gRPC stubs `agent_pb2_grpc.py` and `agent_pb2.py` in Python. Make sure you have `agent.proto` in your local directory.

```
%%bash
pip install grpcio
pip install grpcio-tools
python3 -m grpc_tools.protoc --proto_path=. --python_out=. --grpc_python_out=. agent.proto
```

The preceding code generates the gRPC client and server interfaces from your .proto service definition. In other words, it creates the gRPC model in Python. The API directory contains the Protobuf specification for communicating with the agent.

Next, use the gRPC API to write a client and server for your service (2). The following example script, `edge_manager_python_example.py`, uses Python to load, list, and unload a `yolov3` model to the edge device.

```
import grpc
from PIL import Image
import agent_pb2
import agent_pb2_grpc
import os


model_path = '<PATH-TO-SagemakerEdgeManager-COMPONENT>' 
                    
agent_socket = 'unix:///tmp/aws.greengrass.SageMakerEdgeManager.sock'

agent_channel = grpc.insecure_channel(agent_socket, options=(('grpc.enable_http_proxy', 0),))

agent_client = agent_pb2_grpc.AgentStub(agent_channel)


def list_models():
    return agent_client.ListModels(agent_pb2.ListModelsRequest())


def list_model_tensors(models):
    return {
        model.name: {
            'inputs': model.input_tensor_metadatas,
            'outputs': model.output_tensor_metadatas
        }
        for model in list_models().models
    }


def load_model(model_name, model_path):
    load_request = agent_pb2.LoadModelRequest()
    load_request.url = model_path
    load_request.name = model_name
    return agent_client.LoadModel(load_request)


def unload_model(name):
    unload_request = agent_pb2.UnLoadModelRequest()
    unload_request.name = name
    return agent_client.UnLoadModel(unload_request)


def predict_image(model_name, image_path):
    image_tensor = agent_pb2.Tensor()
    image_tensor.byte_data = Image.open(image_path).tobytes()
    image_tensor_metadata = list_model_tensors(list_models())[model_name]['inputs'][0]
    image_tensor.tensor_metadata.name = image_tensor_metadata.name
    image_tensor.tensor_metadata.data_type = image_tensor_metadata.data_type
    for shape in image_tensor_metadata.shape:
        image_tensor.tensor_metadata.shape.append(shape)
    predict_request = agent_pb2.PredictRequest()
    predict_request.name = model_name
    predict_request.tensors.append(image_tensor)
    predict_response = agent_client.Predict(predict_request)
    return predict_response

def main():
    try:
        unload_model('your-model')
    except:
        pass
  
    print('LoadModel...', end='')
    try:
        load_model('your-model', model_path)
        print('done.')
    except Exception as e:
        print()
        print(e)
        print('Model already loaded!')
        
    print('ListModels...', end='')
    try:
        print(list_models())
        print('done.')
        
    except Exception as e:
        print()
        print(e)
        print('List model failed!')
       
    print('Unload model...', end='')
    try:
        unload_model('your-model')
        print('done.')
    except Exception as e:
        print()
        print(e)
        print('unload model failed!')

if __name__ == '__main__':
    main()
```

Ensure `model_path` points to the name of the AWS IoT Greengrass component containing the model if you use the same client code example.

You can create your AWS IoT Greengrass V2 Hello World component once you have generated your gRPC stubs and you have your Hello World code ready. To do so:
+ Upload your `edge_manager_python_example.py`, `agent_pb2_grpc.py`, and `agent_pb2.py` to your Amazon S3 bucket and note down their Amazon S3 path.
+ Create a private component in the AWS IoT Greengrass V2 console and define the recipe for your component. Specify the Amazon S3 URI to your Hello World application and gRPC stub in the following recipe.

  ```
  ---
  RecipeFormatVersion: 2020-01-25
  ComponentName: com.sagemaker.edgePythonExample
  ComponentVersion: 1.0.0
  ComponentDescription: Sagemaker Edge Manager Python example
  ComponentPublisher: Amazon Web Services, Inc.
  ComponentDependencies:
    aws.greengrass.SageMakerEdgeManager:
      VersionRequirement: '>=1.0.0'
      DependencyType: HARD
  Manifests:
    - Platform:
        os: linux
        architecture: "/amd64|x86/"
      Lifecycle:
        install: |-
          apt-get install python3-pip
          pip3 install grpcio
          pip3 install grpcio-tools
          pip3 install protobuf
          pip3 install Pillow
        run:
          script: |- 
            python3 {artifacts:path}/edge_manager_python_example.py
      Artifacts:
        - URI: <code-s3-path>
        - URI: <pb2-s3-path>
        - URI: <pb2-grpc-s3-path>
  ```

For detailed information about creating a Hello World recipe, see [Create your first component](https://docs.aws.amazon.com/greengrass/v2/developerguide/getting-started.html#create-first-component) in the AWS IoT Greengrass documentation.

# Deploy the components to your device
<a name="edge-greengrass-deploy-components"></a>

Deploy your components with the AWS IoT console or with the AWS CLI.

## To deploy your components (console)
<a name="collapsible-section-gg-deploy-console"></a>

Deploy your AWS IoT Greengrass components with the AWS IoT console.

1. In the AWS IoT Greengrass console at [https://console.aws.amazon.com/iot/](https://console.aws.amazon.com/greengrass/) navigation menu, choose **Deployments**.

1. On the **Components** page, on the **Public components** tab, choose `aws.greengrass.SageMakerEdgeManager`.

1. On the `aws.greengrass.SageMakerEdgeManager` page, choose **Deploy**.

1. From `Add to deployment`, choose one of the following:

   1. To merge this component to an existing deployment on your target device, choose **Add to existing deployment**, and then select the deployment that you want to revise.

   1. To create a new deployment on your target device, choose **Create new deployment**. If you have an existing deployment on your device, choosing this step replaces the existing deployment.

1. On the **Specify target** page, do the following:

   1. Under **Deployment information**, enter or modify the friendly name for your deployment.

   1. Under **Deployment targets**, select a target for your deployment, and choose **Next**. You cannot change the deployment target if you are revising an existing deployment.

1. On the **Select components** page, under **My components**, choose:
   + com.*<CUSTOM-COMPONENT-NAME>*
   + `aws.greengrass.SageMakerEdgeManager`
   + SagemakerEdgeManager.*<YOUR-PACKAGING-JOB>*

1. On the **Configure components** page, choose **com.greengrass.SageMakerEdgeManager**, and do the following.

   1. Choose **Configure component**.

   1. Under **Configuration update**, in **Configuration to merge**, enter the following configuration.

      ```
      {
          "DeviceFleetName": "device-fleet-name",
          "BucketName": "bucket-name"
      }
      ```

      Replace *`device-fleet-name`* with the name of the edge device fleet that you created, and replace *`bucket-name`* with the name of the Amazon S3 bucket that is associated with your device fleet.

   1. Choose **Confirm**, and then choose **Next**.

1. On the **Configure advanced settings** page, keep the default configuration settings, and choose **Next**.

1. On the **Review** page, choose **Deploy**.

## To deploy your components (AWS CLI)
<a name="collapsible-section-gg-deploy-cli"></a>

1. Create a ` deployment.json` file to define the deployment configuration for your SageMaker Edge Manager components. This file should look like the following example.

   ```
   {
     "targetArn":"targetArn",
     "components": {
       "aws.greengrass.SageMakerEdgeManager": {
         "componentVersion": 1.0.0,
         "configurationUpdate": {
           "merge": {
             "DeviceFleetName": "device-fleet-name",
             "BucketName": "bucket-name"
           }
         }
       },
       "com.greengrass.SageMakerEdgeManager.ImageClassification": {
         "componentVersion": 1.0.0,
         "configurationUpdate": {
         }
       }, 
       "com.greengrass.SageMakerEdgeManager.ImageClassification.Model": {
         "componentVersion": 1.0.0,
         "configurationUpdate": {
         }
       }, 
     }
   }
   ```
   + In the `targetArn` field, replace *`targetArn`* with the Amazon Resource Name (ARN) of the thing or thing group to target for the deployment, in the following format:
     + Thing: `arn:aws:iot:region:account-id:thing/thingName`
     + Thing group: `arn:aws:iot:region:account-id:thinggroup/thingGroupName`
   + In the `merge` field, replace *`device-fleet-name`* with the name of the edge device fleet that you created, and replace *`bucket-name`* with the name of the Amazon S3 bucket that is associated with your device fleet.
   + Replace the component versions for each component with the latest available version.

1. Run the following command to deploy the components on the device:

   ```
   aws greengrassv2 create-deployment \
       --cli-input-json file://path/to/deployment.json
   ```

The deployment can take several minutes to complete. In the next step, check the component log to verify that the deployment completed successfully and to view the inference results.

For more information about deploying components to individual devices or groups of devices, see [Deploy AWS IoT Greengrass components to devices](https://docs.aws.amazon.com/greengrass/v2/developerguide/manage-deployments.html).

# Deploy the Model Package Directly with SageMaker Edge Manager Deployment API
<a name="edge-deployment-plan-api"></a>

SageMaker Edge Manager provides a deployment API that you can use to deploy models to device targets without AWS IoT Greengrass. It is useful in situations where you want to update models independently of firmware updates or application deployment mechanisms. You can use the API to integrate your edge deployments into a CI/CD workflow to automatically deploy models once you have validated your model for accuracy. The API also has convenient rollback and staged rollout options for you to ensure models work well in a particular environment before wider rollout.

To use the Edge Manager deployment API first compile and package your model. For information on how to compile and package your model, see [Prepare Your Model for Deployment](edge-getting-started-step2.md). The following sections of this guide show how you can create edge deployments using SageMaker API, after you have compiled and packaged your models.

**Topics**
+ [

## Create an edge deployment plan
](#create-edge-deployment-plan)
+ [

## Start the edge deployment
](#start-edge-deployment-stage)
+ [

## Check the status of the deployment
](#describe-edge-deployment-status)

## Create an edge deployment plan
<a name="create-edge-deployment-plan"></a>

You can create an edge deployment plan with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgeDeploymentPlan.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgeDeploymentPlan.html) API. The deployment plan can have multiple stages. You can configure each stage to rollout the deployment to a subset of edge devices (by percentage, or by device name). You can also configure how rollout failures are handled at each stage.

The following code snippet shows how you can create an edge deployment plan with 1 stage to deploy a compiled and package model to 2 specific edge devices:

```
import boto3

client = boto3.client("sagemaker")

client.create_edge_deployment_plan(
    EdgeDeploymentPlanName="edge-deployment-plan-name",
    DeviceFleetName="device-fleet-name",
    ModelConfigs=[
        {
            "EdgePackagingJobName": "edge-packaging-job-name",
            "ModelHandle": "model-handle"
        }
    ],
    Stages=[
        {
            "StageName": "stage-name",
            "DeviceSelectionConfig": {
                "DeviceSubsetType": "SELECTION",
                "DeviceNames": ["device-name-1", "device-name-2"]
            },
            "DeploymentConfig": {
                "FailureHandlingPolicy": "ROLLBACK_ON_FAILURE"
            }
        }
    ]
)
```

Instead of specific devices, if you want to deploy to the model to a percentage of devices in your fleet, then set the value of `DeviceSubsetType` to `"PERCENTAGE"` and replace `"DeviceNames": ["device-name-1", "device-name-2"]` with `"Percentage": desired-percentage` in the above example.

Stages can be added after the deployment plan has been created with the [CreateEdgeDeploymentStage](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEdgeDeploymentStage.html) API, in case you want to start rolling out new stages after validating your test rollout success. For more information about deployment stages see [DeploymentStage.](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeploymentStage.html)

## Start the edge deployment
<a name="start-edge-deployment-stage"></a>

After creating the deployment plan and the deployment stages, you can start the deployment with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartEdgeDeploymentStage.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartEdgeDeploymentStage.html) API.

```
client.start_edge_deployment_stage(
    EdgeDeploymentPlanName="edge-deployment-plan-name",
    StageName="stage-name"
)
```

## Check the status of the deployment
<a name="describe-edge-deployment-status"></a>

You can check the status of the edge deployment with the [DescribeEdgeDeploymentPlan](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEdgeDeploymentPlan.html) API.

```
client.describe_edge_deployment_plan(
    EdgeDeploymentPlanName="edge-deployment-plan-name"
)
```

# Manage Model
<a name="edge-manage-model"></a>

The Edge Manager agent can load multiple models at a time and make inference with loaded models on edge devices. The number of models the agent can load is determined by the available memory on the device. The agent validates the model signature and loads into memory all the artifacts produced by the edge packaging job. This step requires all the required certificates described in previous steps to be installed along with rest of the binary installation. If the model’s signature cannot be validated, then loading of the model fails with appropriate return code and reason.

SageMaker Edge Manager agent provides a list of Model Management APIs that implement control plane and data plane APIs on edge devices. Along with this documentation, we recommend going through the sample client implementation which shows canonical usage of the below described APIs.

The `proto` file is available as a part of the release artifacts (inside the release tarball). In this doc, we list and describe the usage of APIs listed in this `proto` file.

**Note**  
There is one-to-one mapping for these APIs on Windows release and a sample code for an application implement in C\$1 is shared with the release artifacts for Windows. Below instructions are for running the Agent as a standalone process, applicable for to the release artifacts for Linux.

Extract the archive based on your OS. Where `VERSION` is broken into three components: `<MAJOR_VERSION>.<YYYY-MM-DD>-<SHA-7>`. See [Installing the Edge Manager agent](edge-device-fleet-manual.md#edge-device-fleet-installation) for information on how to obtain the release version (`<MAJOR_VERSION>`), time stamp of the release artifact (`<YYYY-MM-DD>`), and the repository commit ID (`SHA-7`)

------
#### [ Linux ]

The zip archive can be extracted with the command:

```
tar -xvzf <VERSION>.tgz
```

------
#### [ Windows ]

The zip archive can be extracted with the UI or command:

```
unzip <VERSION>.tgz
```

------

The release artifact hierarchy (after extracting the `tar/zip` archive) is shown below. The agent `proto` file is available under `api/`.

```
0.20201205.7ee4b0b
├── bin
│         ├── sagemaker_edge_agent_binary
│         └── sagemaker_edge_agent_client_example
└── docs
├── api
│         └── agent.proto
├── attributions
│         ├── agent.txt
│         └── core.txt
└── examples
└── ipc_example
├── CMakeLists.txt
├── sagemaker_edge_client.cc
├── sagemaker_edge_client_example.cc
├── sagemaker_edge_client.hh
├── sagemaker_edge.proto
├── README.md
├── shm.cc
├── shm.hh
└── street_small.bmp
```

**Topics**
+ [

## Load Model
](#edge-manage-model-loadmodel)
+ [

## Unload Model
](#edge-manage-model-unloadmodel)
+ [

## List Models
](#edge-manage-model-listmodels)
+ [

## Describe Model
](#edge-manage-model-describemodel)
+ [

## Capture Data
](#edge-manage-model-capturedata)
+ [

## Get Capture Status
](#edge-manage-model-getcapturedata)
+ [

## Predict
](#edge-manage-model-predict)

## Load Model
<a name="edge-manage-model-loadmodel"></a>

The Edge Manager agent supports loading multiple models. This API validates the model signature and loads into memory all the artifacts produced by the `EdgePackagingJob` operation. This step requires all the required certificates to be installed along with rest of the agent binary installation. If the model’s signature cannot be validated then this step fails with appropriate return code and error messages in the log.

```
// perform load for a model
// Note:
// 1. currently only local filesystem paths are supported for loading models.
// 2. multiple models can be loaded at the same time, as limited by available device memory
// 3. users are required to unload any loaded model to load another model.
// Status Codes:
// 1. OK - load is successful
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 4. NOT_FOUND - model doesn't exist at the url
// 5. ALREADY_EXISTS - model with the same name is already loaded
// 6. RESOURCE_EXHAUSTED - memory is not available to load the model
// 7. FAILED_PRECONDITION - model is not compiled for the machine.
//
rpc LoadModel(LoadModelRequest) returns (LoadModelResponse);
```

------
#### [ Input ]

```
//
// request for LoadModel rpc call
//
message LoadModelRequest {
  string url = 1;
  string name = 2;  // Model name needs to match regex "^[a-zA-Z0-9](-*[a-zA-Z0-9])*$"
}
```

------
#### [ Output ]

```
//
//
// response for LoadModel rpc call
//
message LoadModelResponse {
  Model model = 1;
}

//
// Model represents the metadata of a model
//  url - url representing the path of the model
//  name - name of model
//  input_tensor_metadatas - TensorMetadata array for the input tensors
//  output_tensor_metadatas - TensorMetadata array for the output tensors
//
// Note:
//  1. input and output tensor metadata could empty for dynamic models.
//
message Model {
  string url = 1;
  string name = 2;
  repeated TensorMetadata input_tensor_metadatas = 3;
  repeated TensorMetadata output_tensor_metadatas = 4;
}
```

------

## Unload Model
<a name="edge-manage-model-unloadmodel"></a>

Unloads a previously loaded model. It is identified via the model alias which was provided during `loadModel`. If the alias is not found or model is not loaded then returns error.

```
//
// perform unload for a model
// Status Codes:
// 1. OK - unload is successful
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 4. NOT_FOUND - model doesn't exist
//
rpc UnLoadModel(UnLoadModelRequest) returns (UnLoadModelResponse);
```

------
#### [ Input ]

```
//
// request for UnLoadModel rpc call
//
message UnLoadModelRequest {
 string name = 1; // Model name needs to match regex "^[a-zA-Z0-9](-*[a-zA-Z0-9])*$"
}
```

------
#### [ Output ]

```
//
// response for UnLoadModel rpc call
//
message UnLoadModelResponse {}
```

------

## List Models
<a name="edge-manage-model-listmodels"></a>

Lists all the loaded models and their aliases.

```
//
// lists the loaded models
// Status Codes:
// 1. OK - unload is successful
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
//
rpc ListModels(ListModelsRequest) returns (ListModelsResponse);
```

------
#### [ Input ]

```
//
// request for ListModels rpc call
//
message ListModelsRequest {}
```

------
#### [ Output ]

```
//
// response for ListModels rpc call
//
message ListModelsResponse {
 repeated Model models = 1;
}
```

------

## Describe Model
<a name="edge-manage-model-describemodel"></a>

Describes a model that is loaded on the agent.

```
//
// Status Codes:
// 1. OK - load is successful
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 4. NOT_FOUND - model doesn't exist at the url
//
rpc DescribeModel(DescribeModelRequest) returns (DescribeModelResponse);
```

------
#### [ Input ]

```
//
// request for DescribeModel rpc call
//
message DescribeModelRequest {
  string name = 1;
}
```

------
#### [ Output ]

```
//
// response for DescribeModel rpc call
//
message DescribeModelResponse {
  Model model = 1;
}
```

------

## Capture Data
<a name="edge-manage-model-capturedata"></a>

Allows the client application to capture input and output tensors in Amazon S3 bucket, and optionally the auxiliary. The client application is expected to pass a unique capture ID along with each call to this API. This can be later used to query status of the capture.

```
//
// allows users to capture input and output tensors along with auxiliary data.
// Status Codes:
// 1. OK - data capture successfully initiated
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 5. ALREADY_EXISTS - capture initiated for the given capture_id
// 6. RESOURCE_EXHAUSTED - buffer is full cannot accept any more requests.
// 7. OUT_OF_RANGE - timestamp is in the future.
// 8. INVALID_ARGUMENT - capture_id is not of expected format.
//
rpc CaptureData(CaptureDataRequest) returns (CaptureDataResponse);
```

------
#### [ Input ]

```
enum Encoding {
 CSV = 0;
 JSON = 1;
 NONE = 2;
 BASE64 = 3;
}

//
// AuxilaryData represents a payload of extra data to be capture along with inputs and outputs of inference
// encoding - supports the encoding of the data
// data - represents the data of shared memory, this could be passed in two ways:
// a. send across the raw bytes of the multi-dimensional tensor array
// b. send a SharedMemoryHandle which contains the posix shared memory segment id and
// offset in bytes to location of multi-dimensional tensor array.
//
message AuxilaryData {
 string name = 1;
 Encoding encoding = 2;
 oneof data {
 bytes byte_data = 3;
 SharedMemoryHandle shared_memory_handle = 4;
 }
}

//
// Tensor represents a tensor, encoded as contiguous multi-dimensional array.
// tensor_metadata - represents metadata of the shared memory segment
// data_or_handle - represents the data of shared memory, this could be passed in two ways:
// a. send across the raw bytes of the multi-dimensional tensor array
// b. send a SharedMemoryHandle which contains the posix shared memory segment
// id and offset in bytes to location of multi-dimensional tensor array.
//
message Tensor {
 TensorMetadata tensor_metadata = 1; //optional in the predict request
 oneof data {
 bytes byte_data = 4;
 // will only be used for input tensors
 SharedMemoryHandle shared_memory_handle = 5;
 }
}

//
// request for CaptureData rpc call
//
message CaptureDataRequest {
 string model_name = 1;
 string capture_id = 2; //uuid string
 Timestamp inference_timestamp = 3;
 repeated Tensor input_tensors = 4;
 repeated Tensor output_tensors = 5;
 repeated AuxilaryData inputs = 6;
 repeated AuxilaryData outputs = 7;
}
```

------
#### [ Output ]

```
//
// response for CaptureData rpc call
//
message CaptureDataResponse {}
```

------

## Get Capture Status
<a name="edge-manage-model-getcapturedata"></a>

Depending on the models loaded the input and output tensors can be large (for many edge devices). Capture to the cloud can be time consuming. So the `CaptureData()` is implemented as an asynchronous operation. A capture ID is a unique identifier that the client provides during capture data call, this ID can be used to query the status of the asynchronous call.

```
//
// allows users to query status of capture data operation
// Status Codes:
// 1. OK - data capture successfully initiated
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 4. NOT_FOUND - given capture id doesn't exist.
//
rpc GetCaptureDataStatus(GetCaptureDataStatusRequest) returns (GetCaptureDataStatusResponse);
```

------
#### [ Input ]

```
//
// request for GetCaptureDataStatus rpc call
//
message GetCaptureDataStatusRequest {
  string capture_id = 1;
}
```

------
#### [ Output ]

```
enum CaptureDataStatus {
  FAILURE = 0;
  SUCCESS = 1;
  IN_PROGRESS = 2;
  NOT_FOUND = 3;
}

//
// response for GetCaptureDataStatus rpc call
//
message GetCaptureDataStatusResponse {
  CaptureDataStatus status = 1;
}
```

------

## Predict
<a name="edge-manage-model-predict"></a>

The `predict` API performs inference on a previously loaded model. It accepts a request in the form of a tensor that is directly fed into the neural network. The output is the output tensor (or scalar) from the model. This is a blocking call.

```
//
// perform inference on a model.
//
// Note:
// 1. users can chose to send the tensor data in the protobuf message or
// through a shared memory segment on a per tensor basis, the Predict
// method with handle the decode transparently.
// 2. serializing large tensors into the protobuf message can be quite expensive,
// based on our measurements it is recommended to use shared memory of
// tenors larger than 256KB.
// 3. SMEdge IPC server will not use shared memory for returning output tensors,
// i.e., the output tensor data will always send in byte form encoded
// in the tensors of PredictResponse.
// 4. currently SMEdge IPC server cannot handle concurrent predict calls, all
// these call will be serialized under the hood. this shall be addressed
// in a later release.
// Status Codes:
// 1. OK - prediction is successful
// 2. UNKNOWN - unknown error has occurred
// 3. INTERNAL - an internal error has occurred
// 4. NOT_FOUND - when model not found
// 5. INVALID_ARGUMENT - when tenors types mismatch
//
rpc Predict(PredictRequest) returns (PredictResponse);
```

------
#### [ Input ]

```
// request for Predict rpc call
//
message PredictRequest {
string name = 1;
repeated Tensor tensors = 2;
}

//
// Tensor represents a tensor, encoded as contiguous multi-dimensional array.
//    tensor_metadata - represents metadata of the shared memory segment
//    data_or_handle - represents the data of shared memory, this could be passed in two ways:
//                        a. send across the raw bytes of the multi-dimensional tensor array
//                        b. send a SharedMemoryHandle which contains the posix shared memory segment
//                            id and offset in bytes to location of multi-dimensional tensor array.
//
message Tensor {
  TensorMetadata tensor_metadata = 1; //optional in the predict request
  oneof data {
    bytes byte_data = 4;
    // will only be used for input tensors
    SharedMemoryHandle shared_memory_handle = 5;
  }
}

//
// Tensor represents a tensor, encoded as contiguous multi-dimensional array.
//    tensor_metadata - represents metadata of the shared memory segment
//    data_or_handle - represents the data of shared memory, this could be passed in two ways:
//                        a. send across the raw bytes of the multi-dimensional tensor array
//                        b. send a SharedMemoryHandle which contains the posix shared memory segment
//                            id and offset in bytes to location of multi-dimensional tensor array.
//
message Tensor {
  TensorMetadata tensor_metadata = 1; //optional in the predict request
  oneof data {
    bytes byte_data = 4;
    // will only be used for input tensors
    SharedMemoryHandle shared_memory_handle = 5;
  }
}

//
// TensorMetadata represents the metadata for a tensor
//    name - name of the tensor
//    data_type  - data type of the tensor
//    shape - array of dimensions of the tensor
//
message TensorMetadata {
  string name = 1;
  DataType data_type = 2;
  repeated int32 shape = 3;
}

//
// SharedMemoryHandle represents a posix shared memory segment
//    offset - offset in bytes from the start of the shared memory segment.
//    segment_id - shared memory segment id corresponding to the posix shared memory segment.
//    size - size in bytes of shared memory segment to use from the offset position.
//
message SharedMemoryHandle {
  uint64 size = 1;
  uint64 offset = 2;
  uint64 segment_id = 3;
}
```

------
#### [ Output ]

**Note**  
The `PredictResponse` only returns `Tensors` and not `SharedMemoryHandle`.

```
// response for Predict rpc call
//
message PredictResponse {
   repeated Tensor tensors = 1;
}
```

------

# SageMaker Edge Manager end of life
<a name="edge-eol"></a>

 Starting in April 26, 2024, you can no longer access Amazon SageMaker Edge Manager through the AWS management console, make edge packaging jobs, and manage edge device fleets. 

## FAQs
<a name="edge-eol-faqs"></a>

 Use the following sections to get answers to commonly asked questions about the SageMaker Edge Manager end of life (EOL). 

### Q: What happens to my Amazon SageMaker Edge Manager after the EOL date?
<a name="edge-eol-faqs-1"></a>

 A: After April 26, 2024, all references to edge packaging jobs, devices, and device fleets are deleted from the Edge Manager service. You can no longer discover or access the Edge Manager service from your AWS console and applications that call on the Edge Manager service APIs no longer work. 

### Q: Will I be billed for Edge Manager resources remaining in my account after the EOL date?
<a name="edge-eol-faqs-2"></a>

 A: Resources created by Edge Manager, such as edge packages inside Amazon S3 buckets, AWS IoT things, and AWS IAM roles, continue to exist on their respective services after April 26, 2024. To avoid being billed after Edge Manager is no longer supported, delete your resources. For more information on deleting your resources, see [Delete Edge Manager resources](#edge-eol-delete-resources). 

### Q: How do I delete my Amazon SageMaker Edge Manager resources?
<a name="edge-eol-faqs-3"></a>

 A: Resources created by Edge Manager, such as edge packages inside Amazon S3 buckets, AWS IoT things, and AWS IAM roles, continue to exist on their respective services after April 26, 2024. To avoid being billed after Edge Manager is no longer supported, delete your resources. For more information on deleting your resources, see [Delete Edge Manager resources](#edge-eol-delete-resources). 

### Q: How can I continue deploying models on the edge?
<a name="edge-eol-faqs-4"></a>

 A: We suggest you try one the following machine learning tools. For a cross-platform edge runtime, use [ONNX](https://onnxruntime.ai/). ONNX is a popular, well-maintained open-source solution that translates your models into instructions that many types of hardware can run, and is compatible with the latest ML frameworks. ONNX can be integrated into your SageMaker AI workflows as an automated step for your edge deployments. 

 For edge deployments and monitoring use AWS IoT Greengrass V2. AWS IoT Greengrass V2 has an extensible packaging and deployment mechanism that can fit models and applications at the edge. You can use the built-in MQTT channels to send model telemetry back for Amazon SageMaker Model Monitor or use the built-in permissions system to send data captured from the model back to Amazon Simple Storage Service (Amazon S3). If you don't or can't use AWS IoT Greengrass V2, we suggest using MQTT and IoT Jobs (C/C\$1\$1 library) to create a lightweight OTA mechanism to deliver models. 

 We have prepared [sample code available at this GitHub repository](https://github.com/aws-samples/ml-edge-getting-started) to help you transition to these suggested tools. 

## Delete Edge Manager resources
<a name="edge-eol-delete-resources"></a>

 Resources created by Edge Manager continue to exist after April 26, 2024. To avoid billing, delete these resources. 

 To delete AWS IoT Greengrass resources, do the following: 

1.  In the AWS IoT Core console, choose **Greengrass devices** under **Manage**. 

1.  Choose **Components**. 

1.  Under **My components**, Edge Manager created components are in the format * SageMaker AIEdge (EdgePackagingJobName)*. Select the component you want to delete. 

1.  Then choose **Delete version**. 

 To delete a AWS IoT role alias, do the following: 

1.  In the AWS IoT Core console, choose **Security** under **Manage**. 

1.  Choose **Role aliases**. 

1.  Edge Manager created role aliases are in the format *SageMaker AIEdge-\$1DeviceFleetName\$1*. Select the role you want to delete. 

1.  Choose **Delete**. 

 To delete packaging jobs in Amazon S3 buckets, do the following: 

1.  In the SageMaker AI console, choose **Edge Inference**. 

1.  Choose **Edge packaging jobs**. 

1.  Select one of the edge packaging jobs. Copy the Amazon S3 URI under **Model artifact** in the **Output configuration** section. 

1.  In the Amazon S3 console, navigate to the corresponding location, and check if you need to delete the model artifact. To delete the model artifact, select the Amazon S3 object and choose **Delete**. 

# Model performance optimization with SageMaker Neo
<a name="neo"></a>

Neo is a capability of Amazon SageMaker AI that enables machine learning models to train once and run anywhere in the cloud and at the edge. 

If you are a first time user of SageMaker Neo, we recommend you check out the [Getting Started with Edge Devices](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-getting-started-edge.html) section to get step-by-step instructions on how to compile and deploy to an edge device. 

## What is SageMaker Neo?
<a name="neo-what-it-is"></a>

Generally, optimizing machine learning models for inference on multiple platforms is difficult because you need to hand-tune models for the specific hardware and software configuration of each platform. If you want to get optimal performance for a given workload, you need to know the hardware architecture, instruction set, memory access patterns, and input data shapes, among other factors. For traditional software development, tools such as compilers and profilers simplify the process. For machine learning, most tools are specific to the framework or to the hardware. This forces you into a manual trial-and-error process that is unreliable and unproductive.

Neo automatically optimizes Gluon, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, and ONNX models for inference on Android, Linux, and Windows machines based on processors from Ambarella, ARM, Intel, Nvidia, NXP, Qualcomm, Texas Instruments, and Xilinx. Neo is tested with computer vision models available in the model zoos across the frameworks. SageMaker Neo supports compilation and deployment for two main platforms: cloud instances (including Inferentia) and edge devices.

For more information about supported frameworks and cloud instance types you can deploy to, see [Supported Instance Types and Frameworks](neo-supported-cloud.md) for cloud instances.

For more information about supported frameworks, edge devices, operating systems, chip architectures, and common machine learning models tested by SageMaker AI Neo for edge devices, see [Supported Frameworks, Devices, Systems, and Architectures](neo-supported-devices-edge.md) for edge devices.

## How it Works
<a name="neo-how-it-works"></a>

Neo consists of a compiler and a runtime. First, the Neo compilation API reads models exported from various frameworks. It converts the framework-specific functions and operations into a framework-agnostic intermediate representation. Next, it performs a series of optimizations. Then it generates binary code for the optimized operations, writes them to a shared object library, and saves the model definition and parameters into separate files. Neo also provides a runtime for each target platform that loads and executes the compiled model.

![\[How Neo works in SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/neo_how_it_works.png)


You can create a Neo compilation job from either the SageMaker AI console, the AWS Command Line Interface (AWS CLI), a Python notebook, or the SageMaker AI SDK.For information on how to compile a model, see [Model Compilation with Neo](neo-job-compilation.md). With a few CLI commands, an API invocation, or a few clicks, you can convert a model for your chosen platform. You can deploy the model to a SageMaker AI endpoint or on an AWS IoT Greengrass device quickly.

Neo can optimize models with parameters either in FP32 or quantized to INT8 or FP16 bit-width.

**Topics**
+ [

## What is SageMaker Neo?
](#neo-what-it-is)
+ [

## How it Works
](#neo-how-it-works)
+ [

# Model Compilation with Neo
](neo-job-compilation.md)
+ [

# Cloud Instances
](neo-cloud-instances.md)
+ [

# Edge Devices
](neo-edge-devices.md)
+ [

# Troubleshoot Errors
](neo-troubleshooting.md)

# Model Compilation with Neo
<a name="neo-job-compilation"></a>

This section shows how to create, describe, stop, and list compilation jobs. The following options are available in Amazon SageMaker Neo for managing the compilation jobs for machine learning models: the AWS Command Line Interface, the Amazon SageMaker AI console, or the Amazon SageMaker SDK. 

**Topics**
+ [

# Prepare Model for Compilation
](neo-compilation-preparing-model.md)
+ [

# Compile a Model (AWS Command Line Interface)
](neo-job-compilation-cli.md)
+ [

# Compile a Model (Amazon SageMaker AI Console)
](neo-job-compilation-console.md)
+ [

# Compile a Model (Amazon SageMaker AI SDK)
](neo-job-compilation-sagemaker-sdk.md)

# Prepare Model for Compilation
<a name="neo-compilation-preparing-model"></a>

SageMaker Neo requires machine learning models to satisfy specific input data shapes. The input shape required for compilation depends on the deep learning framework you use. Once your model input shape is correctly formatted, save your model according to the requirements below. Once you have a saved model, compress the model artifacts.

**Topics**
+ [

## What input data shapes does SageMaker Neo expect?
](#neo-job-compilation-expected-inputs)
+ [

## Saving Models for SageMaker Neo
](#neo-job-compilation-how-to-save-model)

## What input data shapes does SageMaker Neo expect?
<a name="neo-job-compilation-expected-inputs"></a>

Before you compile your model, make sure your model is formatted correctly. Neo expects the name and shape of the expected data inputs for your trained model with JSON format or list format. The expected inputs are framework specific. 

Below are the input shapes SageMaker Neo expects:

### Keras
<a name="collapsible-section-1"></a>

Specify the name and shape (NCHW format) of the expected data inputs using a dictionary format for your trained model. Note that while Keras model artifacts should be uploaded in NHWC (channel-last) format, DataInputConfig should be specified in NCHW (channel-first) format. The dictionary formats required are as follows: 
+ For one input: `{'input_1':[1,3,224,224]}`
+ For two inputs: `{'input_1': [1,3,224,224], 'input_2':[1,3,224,224]}`

### MXNet/ONNX
<a name="collapsible-section-2"></a>

Specify the name and shape (NCHW format) of the expected data inputs using a dictionary format for your trained model. The dictionary formats required are as follows:
+ For one input: `{'data':[1,3,1024,1024]}`
+ For two inputs: `{'var1': [1,1,28,28], 'var2':[1,1,28,28]}`

### PyTorch
<a name="collapsible-section-3"></a>

For a PyTorch model, you don't need to provide the name and shape of the expected data inputs if you meet both of the following conditions:
+ You created your model definition file by using PyTorch 2.0 or later. For more information about how to create the definition file, see the [PyTorch](#how-to-save-pytorch) section under *Saving Models for SageMaker Neo*.
+ You are compiling your model for a cloud instance. For more information about the instance types that SageMaker Neo supports, see [Supported Instance Types and Frameworks](neo-supported-cloud.md).

If you meet these conditions, SageMaker Neo gets the input configuration from the model definition file (.pt or .pth) that you create with PyTorch.

Otherwise, you must do the following:

Specify the name and shape (NCHW format) of the expected data inputs using a dictionary format for your trained model. Alternatively, you can specify the shape only using a list format. The dictionary formats required are as follows:
+ For one input in dictionary format: `{'input0':[1,3,224,224]}`
+ For one input in list format: `[[1,3,224,224]]`
+ For two inputs in dictionary format: `{'input0':[1,3,224,224], 'input1':[1,3,224,224]}`
+ For two inputs in list format: `[[1,3,224,224], [1,3,224,224]]`

### TensorFlow
<a name="collapsible-section-4"></a>

Specify the name and shape (NHWC format) of the expected data inputs using a dictionary format for your trained model. The dictionary formats required are as follows:
+ For one input: `{'input':[1,1024,1024,3]}`
+ For two inputs: `{'data1': [1,28,28,1], 'data2':[1,28,28,1]}`

### TFLite
<a name="collapsible-section-5"></a>

Specify the name and shape (NHWC format) of the expected data inputs using a dictionary format for your trained model. The dictionary formats required are as follows:
+ For one input: `{'input':[1,224,224,3]}`

**Note**  
SageMaker Neo only supports TensorFlow Lite for edge device targets. For a list of supported SageMaker Neo edge device targets, see the SageMaker Neo [Devices](neo-supported-devices-edge-devices.md#neo-supported-edge-devices) page. For a list of supported SageMaker Neo cloud instance targets, see the SageMaker Neo [Supported Instance Types and Frameworks](neo-supported-cloud.md) page.

### XGBoost
<a name="collapsible-section-6"></a>

An input data name and shape are not needed.

## Saving Models for SageMaker Neo
<a name="neo-job-compilation-how-to-save-model"></a>

The following code examples show how to save your model to make it compatible with Neo. Models must be packaged as compressed tar files (`*.tar.gz`).

### Keras
<a name="how-to-save-tf-keras"></a>

Keras models require one model definition file (`.h5`).

There are two options for saving your Keras model in order to make it compatible for SageMaker Neo:

1. Export to `.h5` format with `model.save("<model-name>", save_format="h5")`.

1. Freeze the `SavedModel` after exporting.

Below is an example of how to export a `tf.keras` model as a frozen graph (option two):

```
import os
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras import backend

tf.keras.backend.set_learning_phase(0)
model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3), pooling='avg')
model.summary()

# Save as a SavedModel
export_dir = 'saved_model/'
model.save(export_dir, save_format='tf')

# Freeze saved model
input_node_names = [inp.name.split(":")[0] for inp in model.inputs]
output_node_names = [output.name.split(":")[0] for output in model.outputs]
print("Input names: ", input_node_names)
with tf.Session() as sess:
    loaded = tf.saved_model.load(sess, export_dir=export_dir, tags=["serve"]) 
    frozen_graph = tf.graph_util.convert_variables_to_constants(sess,
                                                                sess.graph.as_graph_def(),
                                                                output_node_names)
    tf.io.write_graph(graph_or_graph_def=frozen_graph, logdir=".", name="frozen_graph.pb", as_text=False)

import tarfile
tar = tarfile.open("frozen_graph.tar.gz", "w:gz")
tar.add("frozen_graph.pb")
tar.close()
```

**Warning**  
Do not export your model with the `SavedModel` class using `model.save(<path>, save_format='tf')`. This format is suitable for training, but it is not suitable for inference.

### MXNet
<a name="how-to-save-mxnet"></a>

MXNet models must be saved as a single symbol file `*-symbol.json` and a single parameter `*.params files`.

------
#### [ Gluon Models ]

Define the neural network using the `HybridSequential` Class. This will run the code in the style of symbolic programming (as opposed to imperative programming).

```
from mxnet import nd, sym
from mxnet.gluon import nn

def get_net():
    net = nn.HybridSequential()  # Here we use the class HybridSequential.
    net.add(nn.Dense(256, activation='relu'),
            nn.Dense(128, activation='relu'),
            nn.Dense(2))
    net.initialize()
    return net

# Define an input to compute a forward calculation. 
x = nd.random.normal(shape=(1, 512))
net = get_net()

# During the forward calculation, the neural network will automatically infer
# the shape of the weight parameters of all the layers based on the shape of
# the input.
net(x)
                        
# hybridize model
net.hybridize()
net(x)

# export model
net.export('<model_name>') # this will create model-symbol.json and model-0000.params files

import tarfile
tar = tarfile.open("<model_name>.tar.gz", "w:gz")
for name in ["<model_name>-0000.params", "<model_name>-symbol.json"]:
    tar.add(name)
tar.close()
```

For more information about hybridizing models, see the [MXNet hybridize documentation](https://mxnet.apache.org/versions/1.7.0/api/python/docs/tutorials/packages/gluon/blocks/hybridize.html).

------
#### [ Gluon Model Zoo (GluonCV) ]

GluonCV model zoo models come pre-hybridized. So you can just export them.

```
import numpy as np
import mxnet as mx
import gluoncv as gcv
from gluoncv.utils import export_block
import tarfile

net = gcv.model_zoo.get_model('<model_name>', pretrained=True) # For example, choose <model_name> as resnet18_v1
export_block('<model_name>', net, preprocess=True, layout='HWC')

tar = tarfile.open("<model_name>.tar.gz", "w:gz")

for name in ["<model_name>-0000.params", "<model_name>-symbol.json"]:
    tar.add(name)
tar.close()
```

------
#### [ Non Gluon Models ]

All non-Gluon models when saved to disk use `*-symbol` and `*.params` files. They are therefore already in the correct format for Neo.

```
# Pass the following 3 parameters: sym, args, aux
mx.model.save_checkpoint('<model_name>',0,sym,args,aux) # this will create <model_name>-symbol.json and <model_name>-0000.params files

import tarfile
tar = tarfile.open("<model_name>.tar.gz", "w:gz")

for name in ["<model_name>-0000.params", "<model_name>-symbol.json"]:
    tar.add(name)
tar.close()
```

------

### PyTorch
<a name="how-to-save-pytorch"></a>

PyTorch models must be saved as a definition file (`.pt` or `.pth`) with input datatype of `float32`.

To save your model, use the `torch.jit.trace` method followed by the `torch.save` method. This process saves an object to a disk file and by default uses python pickle (`pickle_module=pickle`) to save the objects and some metadata. Next, convert the saved model to a compressed tar file.

```
import torchvision
import torch

model = torchvision.models.resnet18(pretrained=True)
model.eval()
inp = torch.rand(1, 3, 224, 224)
model_trace = torch.jit.trace(model, inp)

# Save your model. The following code saves it with the .pth file extension
model_trace.save('model.pth')

# Save as a compressed tar file
import tarfile
with tarfile.open('model.tar.gz', 'w:gz') as f:
    f.add('model.pth')
f.close()
```

If you save your model with PyTorch 2.0 or later, SageMaker Neo derives the input configuration for the model (the name and shape for its input) from the definition file. In that case, you don't need to specify the data input configuration to SageMaker AI when you compile the model.

If you want to prevent SageMaker Neo from deriving the input configuration, you can set the `_store_inputs` parameter of `torch.jit.trace` to `False`. If you do this, you must specify the data input configuration to SageMaker AI when you compile the model.

For more information about the `torch.jit.trace` method, see [TORCH.JIT.TRACE](https://pytorch.org/docs/stable/generated/torch.jit.trace.html#torch.jit.trace) in the PyTorch documentation.

### TensorFlow
<a name="how-to-save-tf"></a>

TensorFlow requires one `.pb` or one `.pbtxt` file and a variables directory that contains variables. For frozen models, only one `.pb` or `.pbtxt` file is required.

The following code example shows how to use the tar Linux command to compress your model. Run the following in your terminal or in a Jupyter notebook (if you use a Jupyter notebook, insert the `!` magic command at the beginning of the statement):

```
# Download SSD_Mobilenet trained model
!wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz

# unzip the compressed tar file
!tar xvf ssd_mobilenet_v2_coco_2018_03_29.tar.gz

# Compress the tar file and save it in a directory called 'model.tar.gz'
!tar czvf model.tar.gz ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.pb
```

The command flags used in this example accomplish the following:
+ `c`: Create an archive
+ `z`: Compress the archive with gzip
+ `v`: Display archive progress
+ `f`: Specify the filename of the archive

### Built-In Estimators
<a name="how-to-save-built-in"></a>

Built-in estimators are either made by framework-specific containers or algorithm-specific containers. Estimator objects for both the built-in algorithm and framework-specific estimator saves the model in the correct format for you when you train the model using the built-in `.fit` method.

For example, you can use a `sagemaker.TensorFlow` to define a TensorFlow estimator:

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='mnist.py',
                        role=role,  #param role can be arn of a sagemaker execution role
                        framework_version='1.15.3',
                        py_version='py3',
                        training_steps=1000, 
                        evaluation_steps=100,
                        instance_count=2,
                        instance_type='ml.c4.xlarge')
```

Then train the model with `.fit` built-in method:

```
estimator.fit(inputs)
```

Before finally compiling model with the build in `compile_model` method:

```
# Specify output path of the compiled model
output_path = '/'.join(estimator.output_path.split('/')[:-1])

# Compile model
optimized_estimator = estimator.compile_model(target_instance_family='ml_c5', 
                              input_shape={'data':[1, 784]},  # Batch size 1, 3 channels, 224x224 Images.
                              output_path=output_path,
                              framework='tensorflow', framework_version='1.15.3')
```

You can also use the `sagemaker.estimator.Estimator` Class to initialize an estimator object for training and compiling a built-in algorithm with the `compile_model` method from the SageMaker Python SDK:

```
import sagemaker
from sagemaker.image_uris import retrieve
sagemaker_session = sagemaker.Session()
aws_region = sagemaker_session.boto_region_name

# Specify built-in algorithm training image
training_image = retrieve(framework='image-classification', 
                          region=aws_region, image_scope='training')

training_image = retrieve(framework='image-classification', region=aws_region, image_scope='training')

# Create estimator object for training
estimator = sagemaker.estimator.Estimator(image_uri=training_image,
                                          role=role,  #param role can be arn of a sagemaker execution role
                                          instance_count=1,
                                          instance_type='ml.p3.8xlarge',
                                          volume_size = 50,
                                          max_run = 360000,
                                          input_mode= 'File',
                                          output_path=s3_training_output_location,
                                          base_job_name='image-classification-training'
                                          )
                                          
# Setup the input data_channels to be used later for training.                                          
train_data = sagemaker.inputs.TrainingInput(s3_training_data_location,
                                            content_type='application/x-recordio',
                                            s3_data_type='S3Prefix')
validation_data = sagemaker.inputs.TrainingInput(s3_validation_data_location,
                                                content_type='application/x-recordio',
                                                s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}


# Train model
estimator.fit(inputs=data_channels, logs=True)

# Compile model with Neo                                                                                  
optimized_estimator = estimator.compile_model(target_instance_family='ml_c5',
                                          input_shape={'data':[1, 3, 224, 224], 'softmax_label':[1]},
                                          output_path=s3_compilation_output_location,
                                          framework='mxnet',
                                          framework_version='1.7')
```

For more information about compiling models with the SageMaker Python SDK, see [Compile a Model (Amazon SageMaker AI SDK)](neo-job-compilation-sagemaker-sdk.md).

# Compile a Model (AWS Command Line Interface)
<a name="neo-job-compilation-cli"></a>

This section shows how to manage Amazon SageMaker Neo compilation jobs for machine learning models using AWS Command Line Interface (CLI). You can create, describe, stop, and list the compilation jobs. 

1. Create a Compilation Job

   With the [CreateCompilationJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCompilationJob.html) API operation, you can specify the data input format, the S3 bucket in which to store your model, the S3 bucket to which to write the compiled model, and the target hardware device or platform.

   The following table demonstrates how to configure `CreateCompilationJob` API based on whether your target is a device or a platform.

------
#### [ Device Example ]

   ```
   {
       "CompilationJobName": "neo-compilation-job-demo",
       "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
       "InputConfig": {
           "S3Uri": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/train",
           "DataInputConfig":  "{'data': [1,3,1024,1024]}",
           "Framework": "MXNET"
       },
       "OutputConfig": {
           "S3OutputLocation": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/compile",
           # A target device specification example for a ml_c5 instance family
           "TargetDevice": "ml_c5"
       },
       "StoppingCondition": {
           "MaxRuntimeInSeconds": 300
       }
   }
   ```

   You can optionally specify the framework version you used with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InputConfig.html#sagemaker-Type-InputConfig-FrameworkVersion](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InputConfig.html#sagemaker-Type-InputConfig-FrameworkVersion) field if you used the PyTorch framework to train your model and your target device is a `ml_* `target.

   ```
   {
       "CompilationJobName": "neo-compilation-job-demo",
       "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
       "InputConfig": {
           "S3Uri": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/train",
           "DataInputConfig":  "{'data': [1,3,1024,1024]}",
           "Framework": "PYTORCH",
           "FrameworkVersion": "1.6"
       },
       "OutputConfig": {
           "S3OutputLocation": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/compile",
           # A target device specification example for a ml_c5 instance family
           "TargetDevice": "ml_c5",
           # When compiling for ml_* instances using PyTorch framework, use the "CompilerOptions" field in 
           # OutputConfig to provide the correct data type ("dtype") of the model’s input. Default assumed is "float32"
           "CompilerOptions": "{'dtype': 'long'}"
       },
       "StoppingCondition": {
           "MaxRuntimeInSeconds": 300
       }
   }
   ```

**Notes:**  
If you saved your model by using PyTorch version 2.0 or later, the `DataInputConfig` field is optional. SageMaker AI Neo gets the input configuration from the model definition file that you create with PyTorch. For more information about how to create the definition file, see the [PyTorch](neo-compilation-preparing-model.md#how-to-save-pytorch) section under *Saving Models for SageMaker AI Neo*.
This API field is only supported for PyTorch.

------
#### [ Platform Example ]

   ```
   {
       "CompilationJobName": "neo-test-compilation-job",
       "RoleArn": "arn:aws:iam::<your-account>:role/service-role/AmazonSageMaker-ExecutionRole-yyyymmddThhmmss",
       "InputConfig": {
           "S3Uri": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/train",
           "DataInputConfig":  "{'data': [1,3,1024,1024]}",
           "Framework": "MXNET"
       },
       "OutputConfig": {
           "S3OutputLocation": "s3://<your-bucket>/sagemaker/neo-compilation-job-demo-data/compile",
           # A target platform configuration example for a p3.2xlarge instance
           "TargetPlatform": {
               "Os": "LINUX",
               "Arch": "X86_64",
               "Accelerator": "NVIDIA"
           },
           "CompilerOptions": "{'cuda-ver': '10.0', 'trt-ver': '6.0.1', 'gpu-code': 'sm_70'}"
       },
       "StoppingCondition": {
           "MaxRuntimeInSeconds": 300
       }
   }
   ```

------
**Note**  
For the `OutputConfig` API operation, the `TargetDevice` and `TargetPlatform` API operations are mutually exclusive. You have to choose one of the two options.

   To find the JSON string examples of `DataInputConfig` depending on frameworks, see [What input data shapes Neo expects](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html#neo-troubleshooting-errors-preventing).

   For more information about setting up the configurations, see the [InputConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InputConfig.html), [OutputConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html), and [TargetPlatform](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TargetPlatform.html) API operations in the SageMaker API reference.

1. After you configure the JSON file, run the following command to create the compilation job:

   ```
   aws sagemaker create-compilation-job \
   --cli-input-json file://job.json \
   --region us-west-2 
   
   # You should get CompilationJobArn
   ```

1. Describe the compilation job by running the following command:

   ```
   aws sagemaker describe-compilation-job \
   --compilation-job-name $JOB_NM \
   --region us-west-2
   ```

1. Stop the compilation job by running the following command:

   ```
   aws sagemaker stop-compilation-job \
   --compilation-job-name $JOB_NM \
   --region us-west-2
   
   # There is no output for compilation-job operation
   ```

1. List the compilation job by running the following command:

   ```
   aws sagemaker list-compilation-jobs \
   --region us-west-2
   ```

# Compile a Model (Amazon SageMaker AI Console)
<a name="neo-job-compilation-console"></a>

You can create an Amazon SageMaker Neo compilation job in the Amazon SageMaker AI console.

1. In the **Amazon SageMaker AI** console, choose **Compilation jobs**, and then choose **Create compilation job**.  
![\[Create a compilation job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/8-create-compilation-job.png)

1. On the **Create compilation job** page, under **Job name**, enter a name. Then select an **IAM role**.  
![\[Create compilation job page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/9-create-compilation-job-config.png)

1. If you don’t have an IAM role, choose **Create a new role**.  
![\[Create IAM role location.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/10a-create-iam-role.png)

1. On the **Create an IAM role** page, choose **Any S3 bucket**, and choose **Create role**.  
![\[Create IAM role page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/10-create-iam-role.png)

1. 

------
#### [ Non PyTorch Frameworks ]

   Within the **Input configuration** section, enter the full path of the Amazon S3 bucket URI that contains your model artifacts in the **Location of model artifacts** input field. Your model artifacts must be in a compressed tarball file format (`.tar.gz`). 

   For the **Data input configuration** field, enter the JSON string that specifies the shape of the input data.

   For **Machine learning framework**, choose the framework of your choice.

![\[Input configuration page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/neo-create-compilation-job-input-config.png)


   To find the JSON string examples of input data shapes depending on frameworks, see [What input data shapes Neo expects](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting.html#neo-troubleshooting-errors-preventing).

------
#### [ PyTorch Framework ]

   Similar instructions apply for compiling PyTorch models. However, if you trained with PyTorch and are trying to compile the model for `ml_*` (except `ml_inf`) target, you can optionally specify the version of PyTorch you used.

![\[Example Input configuration section showing where to choose the Framework version.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/compile_console_pytorch.png)


   To find the JSON string examples of input data shapes depending on frameworks, see [What input data shapes Neo expects](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting.html#neo-troubleshooting-errors-preventing).

**Notes**  
If you saved your model by using PyTorch version 2.0 or later, the **Data input configuration field** is optional. SageMaker Neo gets the input configuration from the model definition file that you create with PyTorch. For more information about how to create the definition file, see the [PyTorch](neo-compilation-preparing-model.md#how-to-save-pytorch) section under *Saving Models for SageMaker AI Neo*.
When compiling for `ml_*` instances using PyTorch framework, use **Compiler options** field in **Output Configuration** to provide the correct data type (`dtype`) of the model’s input. The default is set to `"float32"`. 

![\[Example Output Configuration section.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/neo_compilation_console_pytorch_compiler_options.png)


**Warning**  
 If you specify a Amazon S3 bucket URI path that leads to `.pth` file, you will receive the following error after starting compilation: `ClientError: InputConfiguration: Unable to untar input model.Please confirm the model is a tar.gz file` 

------

1.  Go to the **Output configuration** section. Choose where you want to deploy your model. You can deploy your model to a **Target device** or a **Target platform**. Target devices include cloud and edge devices. Target platforms refer to specific OS, architecture, and accelerators you want your model to run on. 

    For **S3 Output location**, enter the path to the S3 bucket where you want to store the model. You can optionally add compiler options in JSON format under the **Compiler options** section.   
![\[Output configuration page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/neo-console-output-config.png)

1. Check the status of the compilation job when started. This status of the job can be found at the top of the **Compilation Job** page, as shown in the following screenshot. You can also check the status of it in the **Status** column.  
![\[Compilation job status.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/12-run-model-compilation.png)

1. Check the status of the compilation job when completed. You can check the status in the **Status** column as shown in the following screenshot.  
![\[Compilation job status.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo/12a-completed-model-compilation.png)

# Compile a Model (Amazon SageMaker AI SDK)
<a name="neo-job-compilation-sagemaker-sdk"></a>

 You can use the [https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html?#sagemaker.estimator.Estimator.compile_model](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html?#sagemaker.estimator.Estimator.compile_model) API in the [Amazon SageMaker AI SDK for Python](https://sagemaker.readthedocs.io/en/stable/) to compile a trained model and optimize it for specific target hardware. The API should be invoked on the estimator object used during model training. 

**Note**  
You must set `MMS_DEFAULT_RESPONSE_TIMEOUT` environment variable to `500` when compiling the model with MXNet or PyTorch. The environment variable is not needed for TensorFlow. 

 The following is an example of how you can compile a model using the `trained_model_estimator` object: 

```
# Replace the value of expected_trained_model_input below and
# specify the name & shape of the expected inputs for your trained model
# in json dictionary form
expected_trained_model_input = {'data':[1, 784]}

# Replace the example target_instance_family below to your preferred target_instance_family
compiled_model = trained_model_estimator.compile_model(target_instance_family='ml_c5',
        input_shape=expected_trained_model_input,
        output_path='insert s3 output path',
        env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'})
```

The code compiles the model, saves the optimized model at `output_path`, and creates a SageMaker AI model that can be deployed to an endpoint. 

# Cloud Instances
<a name="neo-cloud-instances"></a>

Amazon SageMaker Neo provides compilation support for popular machine learning frameworks such as TensorFlow, PyTorch, MXNet, and more. You can deploy your compiled model to cloud instances and AWS Inferentia instances. For a full list of supported frameworks and instances types, see [Supported Instances Types and Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html). 

You can compile your model in one of three ways: through the AWS CLI, the SageMaker AI Console, or the SageMaker AI SDK for Python. See, [Use Neo to Compile a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html) for more information. Once compiled, your model artifacts are stored in the Amazon S3 bucket URI you specified during the compilation job. You can deploy your compiled model to cloud instances and AWS Inferentia instances using the SageMaker AI SDK for Python, AWS SDK for Python (Boto3), AWS CLI, or the AWS console. 

If you deploy your model using AWS CLI, the console, or Boto3, you must select a Docker image Amazon ECR URI for your primary container. See [Neo Inference Container Images](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html) for a list of Amazon ECR URIs.

**Topics**
+ [

# Supported Instance Types and Frameworks
](neo-supported-cloud.md)
+ [

# Deploy a Model
](neo-deployment-hosting-services.md)
+ [

# Inference Requests With a Deployed Service
](neo-requests.md)
+ [

# Inference Container Images
](neo-deployment-hosting-services-container-images.md)

# Supported Instance Types and Frameworks
<a name="neo-supported-cloud"></a>

Amazon SageMaker Neo supports popular deep learning frameworks for both compilation and deployment. You can deploy your model to cloud instances or AWS Inferentia instance types.

The following describes frameworks SageMaker Neo supports and the target cloud instances you can compile and deploy to. For information on how to deploy your compiled model to a cloud or Inferentia instance, see [Deploy a Model with Cloud Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services.html).

## Cloud Instances
<a name="neo-supported-cloud-instances"></a>

SageMaker Neo supports the following deep learning frameworks for CPU and GPU cloud instances: 


| Framework | Framework Version | Model Version | Models | Model Formats (packaged in \$1.tar.gz) | Toolkits | 
| --- | --- | --- | --- | --- | --- | 
| MXNet | 1.8.0 | Supports 1.8.0 or earlier | Image Classification, Object Detection, Semantic Segmentation, Pose Estimation, Activity Recognition | One symbol file (.json) and one parameter file (.params) | GluonCV v0.8.0 | 
| ONNX | 1.7.0 | Supports 1.7.0 or earlier | Image Classification, SVM | One model file (.onnx) |  | 
| Keras | 2.2.4 | Supports 2.2.4 or earlier | Image Classification | One model definition file (.h5) |  | 
| PyTorch | 1.4, 1.5, 1.6, 1.7, 1.8, 1.12, 1.13, or 2.0 | Supports 1.4, 1.5, 1.6, 1.7, 1.8, 1.12, 1.13, and 2.0 |  Image Classification Versions 1.13 and 2.0 support Object Detection, Vision Transformer, and HuggingFace  | One model definition file (.pt or .pth) with input dtype of float32 |  | 
| TensorFlow | 1.15.3 or 2.9 | Supports 1.15.3 and 2.9 | Image Classification | For saved models, one .pb or one .pbtxt file and a variables directory that contains variables For frozen models, only one .pb or .pbtxt file |  | 
| XGBoost | 1.3.3 | Supports 1.3.3 or earlier | Decision Trees | One XGBoost model file (.model) where the number of nodes in a tree is less than 2^31 |  | 

**Note**  
“Model Version” is the version of the framework used to train and export the model. 

## Instance Types
<a name="neo-supported-cloud-instances-types"></a>

 You can deploy your SageMaker AI compiled model to one of the cloud instances listed below: 


| Instance | Compute Type | 
| --- | --- | 
| `ml_c4` | Standard | 
| `ml_c5` | Standard | 
| `ml_m4` | Standard | 
| `ml_m5` | Standard | 
| `ml_p2` | Accelerated computing | 
| `ml_p3` | Accelerated computing | 
| `ml_g4dn` | Accelerated computing | 

 For information on the available vCPU, memory, and price per hour for each instance type, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

**Note**  
When compiling for `ml_*` instances using PyTorch framework, use **Compiler options** field in **Output Configuration** to provide the correct data type (`dtype`) of the model’s input.  
The default is set to `"float32"`.

## AWS Inferentia
<a name="neo-supported-inferentia"></a>

 SageMaker Neo supports the following deep learning frameworks for Inf1: 


| Framework | Framework Version | Model Version | Models | Model Formats (packaged in \$1.tar.gz) | Toolkits | 
| --- | --- | --- | --- | --- | --- | 
| MXNet | 1.5 or 1.8  | Supports 1.8, 1.5 and earlier | Image Classification, Object Detection, Semantic Segmentation, Pose Estimation, Activity Recognition | One symbol file (.json) and one parameter file (.params) | GluonCV v0.8.0 | 
| PyTorch | 1.7, 1.8 or 1.9 | Supports 1.9 and earlier | Image Classification | One model definition file (.pt or .pth) with input dtype of float32 |  | 
| TensorFlow | 1.15 or 2.5 | Supports 2.5, 1.15 and earlier | Image Classification | For saved models, one .pb or one .pbtxt file and a variables directory that contains variables For frozen models, only one .pb or .pbtxt file |  | 

**Note**  
“Model Version” is the version of the framework used to train and export the model.

You can deploy your SageMaker Neo-compiled model to AWS Inferentia-based Amazon EC2 Inf1 instances. AWS Inferentia is Amazon's first custom silicon chip designed to accelerate deep learning. Currently, you can use the `ml_inf1` instance to deploy your compiled models.

### AWS Inferentia2 and AWS Trainium
<a name="neo-supported-inferentia-trainium"></a>

Currently, you can deploy your SageMaker Neo-compiled model to AWS Inferentia2-based Amazon EC2 Inf2 instances (in US East (Ohio) Region), and to AWS Trainium-based Amazon EC2 Trn1 instances (in US East (N. Virginia) Region). For more information about supported models on these instances, see [ Model Architecture Fit Guidelines](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html) in the AWS Neuron documentation, and the examples in the [Neuron Github repository](https://github.com/aws-neuron/aws-neuron-sagemaker-samples).

# Deploy a Model
<a name="neo-deployment-hosting-services"></a>

To deploy an Amazon SageMaker Neo-compiled model to an HTTPS endpoint, you must configure and create the endpoint for the model using Amazon SageMaker AI hosting services. Currently, developers can use Amazon SageMaker APIs to deploy modules on to ml.c5, ml.c4, ml.m5, ml.m4, ml.p3, ml.p2, and ml.inf1 instances. 

For [Inferentia](https://aws.amazon.com/machine-learning/inferentia/) and [Trainium](https://aws.amazon.com/machine-learning/trainium/) instances, models need to be compiled specifically for those instances. Models compiled for other instance types are not guaranteed to work with Inferentia or Trainium instances.

When you deploy a compiled model, you need to use the same instance for the target that you used for compilation. This creates a SageMaker AI endpoint that you can use to perform inferences. You can deploy a Neo-compiled model using any of the following: [Amazon SageMaker AI SDK for Python](https://sagemaker.readthedocs.io/en/stable/), [SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html), [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/reference/), and the [SageMaker AI console](https://console.aws.amazon.com/sagemaker).

**Note**  
For deploying a model using AWS CLI, the console, or Boto3, see [Neo Inference Container Images](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html) to select the inference image URI for your primary container. 

**Topics**
+ [

# Prerequisites
](neo-deployment-hosting-services-prerequisites.md)
+ [

# Deploy a Compiled Model Using SageMaker SDK
](neo-deployment-hosting-services-sdk.md)
+ [

# Deploy a Compiled Model Using Boto3
](neo-deployment-hosting-services-boto3.md)
+ [

# Deploy a Compiled Model Using the AWS CLI
](neo-deployment-hosting-services-cli.md)
+ [

# Deploy a Compiled Model Using the Console
](neo-deployment-hosting-services-console.md)

# Prerequisites
<a name="neo-deployment-hosting-services-prerequisites"></a>

**Note**  
Follow the instructions in this section if you compiled your model using AWS SDK for Python (Boto3), AWS CLI, or the SageMaker AI console. 

To create a SageMaker Neo-compiled model, you need the following:

1. A Docker image Amazon ECR URI. You can select one that meets your needs from [this list](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html). 

1. An entry point script file:

   1. **For PyTorch and MXNet models:**

      *If you trained your model using SageMaker AI*, the training script must implement the functions described below. The training script serves as the entry point script during inference. In the example detailed in [ MNIST Training, Compilation and Deployment with MXNet Module and SageMaker Neo](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_neo_compilation_jobs/mxnet_mnist/mxnet_mnist_neo.html), the training script (`mnist.py`) implements the required functions.

      * If you did not train your model using SageMaker AI*, you need to provide an entry point script (`inference.py`) file that can be used at the time of inference. Based on the framework—MXNet or PyTorch—the inference script location must conform to the SageMaker Python SDK [Model Directory Structure for MxNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#model-directory-structure) or [ Model Directory Structure for PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#model-directory-structure). 

      When using Neo Inference Optimized Container images with **PyTorch** and **MXNet** on CPU and GPU instance types, the inference script must implement the following functions: 
      + `model_fn`: Loads the model. (Optional)
      + `input_fn`: Converts the incoming request payload into a numpy array.
      + `predict_fn`: Performs the prediction.
      + `output_fn`: Converts the prediction output into the response payload.
      + Alternatively, you can define `transform_fn` to combine `input_fn`, `predict_fn`, and `output_fn`.

      The following are examples of `inference.py` script within a directory named `code` (`code/inference.py`) for **PyTorch and MXNet (Gluon and Module).** The examples first load the model and then serve it on image data on a GPU: 

------
#### [ MXNet Module ]

      ```
      import numpy as np
      import json
      import mxnet as mx
      import neomx  # noqa: F401
      from collections import namedtuple
      
      Batch = namedtuple('Batch', ['data'])
      
      # Change the context to mx.cpu() if deploying to a CPU endpoint
      ctx = mx.gpu()
      
      def model_fn(model_dir):
          # The compiled model artifacts are saved with the prefix 'compiled'
          sym, arg_params, aux_params = mx.model.load_checkpoint('compiled', 0)
          mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
          exe = mod.bind(for_training=False,
                         data_shapes=[('data', (1,3,224,224))],
                         label_shapes=mod._label_shapes)
          mod.set_params(arg_params, aux_params, allow_missing=True)
          
          # Run warm-up inference on empty data during model load (required for GPU)
          data = mx.nd.empty((1,3,224,224), ctx=ctx)
          mod.forward(Batch([data]))
          return mod
      
      
      def transform_fn(mod, image, input_content_type, output_content_type):
          # pre-processing
          decoded = mx.image.imdecode(image)
          resized = mx.image.resize_short(decoded, 224)
          cropped, crop_info = mx.image.center_crop(resized, (224, 224))
          normalized = mx.image.color_normalize(cropped.astype(np.float32) / 255,
                                        mean=mx.nd.array([0.485, 0.456, 0.406]),
                                        std=mx.nd.array([0.229, 0.224, 0.225]))
          transposed = normalized.transpose((2, 0, 1))
          batchified = transposed.expand_dims(axis=0)
          casted = batchified.astype(dtype='float32')
          processed_input = casted.as_in_context(ctx)
      
          # prediction/inference
          mod.forward(Batch([processed_input]))
      
          # post-processing
          prob = mod.get_outputs()[0].asnumpy().tolist()
          prob_json = json.dumps(prob)
          return prob_json, output_content_type
      ```

------
#### [ MXNet Gluon ]

      ```
      import numpy as np
      import json
      import mxnet as mx
      import neomx  # noqa: F401
      
      # Change the context to mx.cpu() if deploying to a CPU endpoint
      ctx = mx.gpu()
      
      def model_fn(model_dir):
          # The compiled model artifacts are saved with the prefix 'compiled'
          block = mx.gluon.nn.SymbolBlock.imports('compiled-symbol.json',['data'],'compiled-0000.params', ctx=ctx)
          
          # Hybridize the model & pass required options for Neo: static_alloc=True & static_shape=True
          block.hybridize(static_alloc=True, static_shape=True)
          
          # Run warm-up inference on empty data during model load (required for GPU)
          data = mx.nd.empty((1,3,224,224), ctx=ctx)
          warm_up = block(data)
          return block
      
      
      def input_fn(image, input_content_type):
          # pre-processing
          decoded = mx.image.imdecode(image)
          resized = mx.image.resize_short(decoded, 224)
          cropped, crop_info = mx.image.center_crop(resized, (224, 224))
          normalized = mx.image.color_normalize(cropped.astype(np.float32) / 255,
                                        mean=mx.nd.array([0.485, 0.456, 0.406]),
                                        std=mx.nd.array([0.229, 0.224, 0.225]))
          transposed = normalized.transpose((2, 0, 1))
          batchified = transposed.expand_dims(axis=0)
          casted = batchified.astype(dtype='float32')
          processed_input = casted.as_in_context(ctx)
          return processed_input
      
      
      def predict_fn(processed_input_data, block):
          # prediction/inference
          prediction = block(processed_input_data)
          return prediction
      
      def output_fn(prediction, output_content_type):
          # post-processing
          prob = prediction.asnumpy().tolist()
          prob_json = json.dumps(prob)
          return prob_json, output_content_type
      ```

------
#### [ PyTorch 1.4 and Older ]

      ```
      import os
      import torch
      import torch.nn.parallel
      import torch.optim
      import torch.utils.data
      import torch.utils.data.distributed
      import torchvision.transforms as transforms
      from PIL import Image
      import io
      import json
      import pickle
      
      
      def model_fn(model_dir):
          """Load the model and return it.
          Providing this function is optional.
          There is a default model_fn available which will load the model
          compiled using SageMaker Neo. You can override it here.
      
          Keyword arguments:
          model_dir -- the directory path where the model artifacts are present
          """
      
          # The compiled model is saved as "compiled.pt"
          model_path = os.path.join(model_dir, 'compiled.pt')
          with torch.neo.config(model_dir=model_dir, neo_runtime=True):
              model = torch.jit.load(model_path)
              device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
              model = model.to(device)
      
          # We recommend that you run warm-up inference during model load
          sample_input_path = os.path.join(model_dir, 'sample_input.pkl')
          with open(sample_input_path, 'rb') as input_file:
              model_input = pickle.load(input_file)
          if torch.is_tensor(model_input):
              model_input = model_input.to(device)
              model(model_input)
          elif isinstance(model_input, tuple):
              model_input = (inp.to(device) for inp in model_input if torch.is_tensor(inp))
              model(*model_input)
          else:
              print("Only supports a torch tensor or a tuple of torch tensors")
              return model
      
      
      def transform_fn(model, request_body, request_content_type,
                       response_content_type):
          """Run prediction and return the output.
          The function
          1. Pre-processes the input request
          2. Runs prediction
          3. Post-processes the prediction output.
          """
          # preprocess
          decoded = Image.open(io.BytesIO(request_body))
          preprocess = transforms.Compose([
              transforms.Resize(256),
              transforms.CenterCrop(224),
              transforms.ToTensor(),
              transforms.Normalize(
                  mean=[
                      0.485, 0.456, 0.406], std=[
                      0.229, 0.224, 0.225]),
          ])
          normalized = preprocess(decoded)
          batchified = normalized.unsqueeze(0)
          # predict
          device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
          batchified = batchified.to(device)
          output = model.forward(batchified)
      
          return json.dumps(output.cpu().numpy().tolist()), response_content_type
      ```

------
#### [ PyTorch 1.5 and Newer ]

      ```
      import os
      import torch
      import torch.nn.parallel
      import torch.optim
      import torch.utils.data
      import torch.utils.data.distributed
      import torchvision.transforms as transforms
      from PIL import Image
      import io
      import json
      import pickle
      
      
      def model_fn(model_dir):
          """Load the model and return it.
          Providing this function is optional.
          There is a default_model_fn available, which will load the model
          compiled using SageMaker Neo. You can override the default here.
          The model_fn only needs to be defined if your model needs extra
          steps to load, and can otherwise be left undefined.
      
          Keyword arguments:
          model_dir -- the directory path where the model artifacts are present
          """
      
          # The compiled model is saved as "model.pt"
          model_path = os.path.join(model_dir, 'model.pt')
          device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
          model = torch.jit.load(model_path, map_location=device)
          model = model.to(device)
      
          return model
      
      
      def transform_fn(model, request_body, request_content_type,
                          response_content_type):
          """Run prediction and return the output.
          The function
          1. Pre-processes the input request
          2. Runs prediction
          3. Post-processes the prediction output.
          """
          # preprocess
          decoded = Image.open(io.BytesIO(request_body))
          preprocess = transforms.Compose([
                                      transforms.Resize(256),
                                      transforms.CenterCrop(224),
                                      transforms.ToTensor(),
                                      transforms.Normalize(
                                          mean=[
                                              0.485, 0.456, 0.406], std=[
                                              0.229, 0.224, 0.225]),
                                          ])
          normalized = preprocess(decoded)
          batchified = normalized.unsqueeze(0)
          
          # predict
          device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
          batchified = batchified.to(device)
          output = model.forward(batchified)
          return json.dumps(output.cpu().numpy().tolist()), response_content_type
      ```

------

   1.  **For inf1 instances or onnx, xgboost, keras container images** 

      For all other Neo Inference-optimized container images, or inferentia instance types, the entry point script must implement the following functions for Neo Deep Learning Runtime: 
      + `neo_preprocess`: Converts the incoming request payload into a numpy array.
      + `neo_postprocess`: Converts the prediction output from Neo Deep Learning Runtime into the response body.
**Note**  
The preceding two functions do not use any of the functionalities of MXNet, PyTorch, or TensorFlow.

      For examples of how to use these functions, see [Neo Model Compilation Sample Notebooks](https://docs.aws.amazon.com//sagemaker/latest/dg/neo.html#neo-sample-notebooks). 

   1. **For TensorFlow models**

      If your model requires custom pre- and post-processing logic before data is sent to the model, then you must specify an entry point script `inference.py` file that can be used at the time of inference. The script should implement either a either a pair of `input_handler` and `output_handler` functions or a single handler function. 
**Note**  
Note that if handler function is implemented, `input_handler` and `output_handler` are ignored. 

      The following is a code example of `inference.py` script that you can put together with the compile model to perform custom pre- and post-processing on an image classification model. The SageMaker AI client sends the image file as an `application/x-image` content type to the `input_handler` function, where it is converted to JSON. The converted image file is then sent to the [Tensorflow Model Server (TFX)](https://www.tensorflow.org/tfx/serving/api_rest) using the REST API. 

      ```
      import json
      import numpy as np
      import json
      import io
      from PIL import Image
      
      def input_handler(data, context):
          """ Pre-process request input before it is sent to TensorFlow Serving REST API
          
          Args:
          data (obj): the request data, in format of dict or string
          context (Context): an object containing request and configuration details
          
          Returns:
          (dict): a JSON-serializable dict that contains request body and headers
          """
          f = data.read()
          f = io.BytesIO(f)
          image = Image.open(f).convert('RGB')
          batch_size = 1
          image = np.asarray(image.resize((512, 512)))
          image = np.concatenate([image[np.newaxis, :, :]] * batch_size)
          body = json.dumps({"signature_name": "serving_default", "instances": image.tolist()})
          return body
      
      def output_handler(data, context):
          """Post-process TensorFlow Serving output before it is returned to the client.
          
          Args:
          data (obj): the TensorFlow serving response
          context (Context): an object containing request and configuration details
          
          Returns:
          (bytes, string): data to return to client, response content type
          """
          if data.status_code != 200:
              raise ValueError(data.content.decode('utf-8'))
      
          response_content_type = context.accept_header
          prediction = data.content
          return prediction, response_content_type
      ```

      If there is no custom pre- or post-processing, the SageMaker AI client converts the file image to JSON in a similar way before sending it over to the SageMaker AI endpoint. 

      For more information, see the [Deploying to TensorFlow Serving Endpoints in the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#providing-python-scripts-for-pre-pos-processing). 

1. The Amazon S3 bucket URI that contains the compiled model artifacts. 

# Deploy a Compiled Model Using SageMaker SDK
<a name="neo-deployment-hosting-services-sdk"></a>

You must satisfy the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites) section if the model was compiled using AWS SDK for Python (Boto3), AWS CLI, or the Amazon SageMaker AI console. Follow one of the following use cases to deploy a model compiled with SageMaker Neo based on how you compiled your model.

**Topics**
+ [

## If you compiled your model using the SageMaker SDK
](#neo-deployment-hosting-services-sdk-deploy-sm-sdk)
+ [

## If you compiled your model using MXNet or PyTorch
](#neo-deployment-hosting-services-sdk-deploy-sm-boto3)
+ [

## If you compiled your model using Boto3, SageMaker console, or the CLI for TensorFlow
](#neo-deployment-hosting-services-sdk-deploy-sm-boto3-tensorflow)

## If you compiled your model using the SageMaker SDK
<a name="neo-deployment-hosting-services-sdk-deploy-sm-sdk"></a>

The [sagemaker.Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html?highlight=sagemaker.Model) object handle for the compiled model supplies the [deploy()](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html?highlight=sagemaker.Model#sagemaker.model.Model.deploy) function, which enables you to create an endpoint to serve inference requests. The function lets you set the number and type of instances that are used for the endpoint. You must choose an instance for which you have compiled your model. For example, in the job compiled in [Compile a Model (Amazon SageMaker SDK)](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation-sagemaker-sdk.html) section, this is `ml_c5`. 

```
predictor = compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.c5.4xlarge')

# Print the name of newly created endpoint
print(predictor.endpoint_name)
```

## If you compiled your model using MXNet or PyTorch
<a name="neo-deployment-hosting-services-sdk-deploy-sm-boto3"></a>

Create the SageMaker AI model and deploy it using the deploy() API under the framework-specific Model APIs. For MXNet, it is [MXNetModel](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html?highlight=MXNetModel#mxnet-model) and for PyTorch, it is [ PyTorchModel](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html?highlight=PyTorchModel#sagemaker.pytorch.model.PyTorchModel). When you are creating and deploying an SageMaker AI model, you must set `MMS_DEFAULT_RESPONSE_TIMEOUT` environment variable to `500` and specify the `entry_point` parameter as the inference script (`inference.py`) and the `source_dir` parameter as the directory location (`code`) of the inference script. To prepare the inference script (`inference.py`) follow the Prerequisites step. 

The following example shows how to use these functions to deploy a compiled model using the SageMaker AI SDK for Python: 

------
#### [ MXNet ]

```
from sagemaker.mxnet import MXNetModel

# Create SageMaker model and deploy an endpoint
sm_mxnet_compiled_model = MXNetModel(
    model_data='insert S3 path of compiled MXNet model archive',
    role='AmazonSageMaker-ExecutionRole',
    entry_point='inference.py',
    source_dir='code',
    framework_version='1.8.0',
    py_version='py3',
    image_uri='insert appropriate ECR Image URI for MXNet',
    env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'},
)

# Replace the example instance_type below to your preferred instance_type
predictor = sm_mxnet_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')

# Print the name of newly created endpoint
print(predictor.endpoint_name)
```

------
#### [ PyTorch 1.4 and Older ]

```
from sagemaker.pytorch import PyTorchModel

# Create SageMaker model and deploy an endpoint
sm_pytorch_compiled_model = PyTorchModel(
    model_data='insert S3 path of compiled PyTorch model archive',
    role='AmazonSageMaker-ExecutionRole',
    entry_point='inference.py',
    source_dir='code',
    framework_version='1.4.0',
    py_version='py3',
    image_uri='insert appropriate ECR Image URI for PyTorch',
    env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'},
)

# Replace the example instance_type below to your preferred instance_type
predictor = sm_pytorch_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')

# Print the name of newly created endpoint
print(predictor.endpoint_name)
```

------
#### [ PyTorch 1.5 and Newer ]

```
from sagemaker.pytorch import PyTorchModel

# Create SageMaker model and deploy an endpoint
sm_pytorch_compiled_model = PyTorchModel(
    model_data='insert S3 path of compiled PyTorch model archive',
    role='AmazonSageMaker-ExecutionRole',
    entry_point='inference.py',
    source_dir='code',
    framework_version='1.5',
    py_version='py3',
    image_uri='insert appropriate ECR Image URI for PyTorch',
)

# Replace the example instance_type below to your preferred instance_type
predictor = sm_pytorch_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')

# Print the name of newly created endpoint
print(predictor.endpoint_name)
```

------

**Note**  
The `AmazonSageMakerFullAccess` and `AmazonS3ReadOnlyAccess` policies must be attached to the `AmazonSageMaker-ExecutionRole` IAM role. 

## If you compiled your model using Boto3, SageMaker console, or the CLI for TensorFlow
<a name="neo-deployment-hosting-services-sdk-deploy-sm-boto3-tensorflow"></a>

Construct a `TensorFlowModel` object, then call deploy: 

```
role='AmazonSageMaker-ExecutionRole'
model_path='S3 path for model file'
framework_image='inference container arn'
tf_model = TensorFlowModel(model_data=model_path,
                framework_version='1.15.3',
                role=role, 
                image_uri=framework_image)
instance_type='ml.c5.xlarge'
predictor = tf_model.deploy(instance_type=instance_type,
                    initial_instance_count=1)
```

See [Deploying directly from model artifacts](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#deploying-directly-from-model-artifacts) for more information. 

You can select a Docker image Amazon ECR URI that meets your needs from [this list](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html). 

For more information on how to construct a `TensorFlowModel` object, see the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-serving-model). 

**Note**  
Your first inference request might have high latency if you deploy your model on a GPU. This is because an optimized compute kernel is made on the first inference request. We recommend that you make a warm-up file of inference requests and store that alongside your model file before sending it off to a TFX. This is known as “warming up” the model. 

The following code snippet demonstrates how to produce the warm-up file for image classification example in the [prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites) section: 

```
import tensorflow as tf
from tensorflow_serving.apis import classification_pb2
from tensorflow_serving.apis import inference_pb2
from tensorflow_serving.apis import model_pb2
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_log_pb2
from tensorflow_serving.apis import regression_pb2
import numpy as np

with tf.python_io.TFRecordWriter("tf_serving_warmup_requests") as writer:       
    img = np.random.uniform(0, 1, size=[224, 224, 3]).astype(np.float32)
    img = np.expand_dims(img, axis=0)
    test_data = np.repeat(img, 1, axis=0)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'compiled_models'
    request.model_spec.signature_name = 'serving_default'
    request.inputs['Placeholder:0'].CopyFrom(tf.compat.v1.make_tensor_proto(test_data, shape=test_data.shape, dtype=tf.float32))
    log = prediction_log_pb2.PredictionLog(
    predict_log=prediction_log_pb2.PredictLog(request=request))
    writer.write(log.SerializeToString())
```

For more information on how to “warm up” your model, see the [TensorFlow TFX page](https://www.tensorflow.org/tfx/serving/saved_model_warmup).

# Deploy a Compiled Model Using Boto3
<a name="neo-deployment-hosting-services-boto3"></a>

You must satisfy the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites) section if the model was compiled using AWS SDK for Python (Boto3), AWS CLI, or the Amazon SageMaker AI console. Follow the steps below to create and deploy a SageMaker Neo-compiled model using [Amazon Web Services SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html). 

**Topics**
+ [

## Deploy the Model
](#neo-deployment-hosting-services-boto3-steps)

## Deploy the Model
<a name="neo-deployment-hosting-services-boto3-steps"></a>

After you have satisfied the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites), use the `create_model`, `create_enpoint_config`, and `create_endpoint` APIs. 

The following example shows how to use these APIs to deploy a model compiled with Neo: 

```
import boto3
client = boto3.client('sagemaker')

# create sagemaker model
create_model_api_response = client.create_model(
                                    ModelName='my-sagemaker-model',
                                    PrimaryContainer={
                                        'Image': <insert the ECR Image URI>,
                                        'ModelDataUrl': 's3://path/to/model/artifact/model.tar.gz',
                                        'Environment': {}
                                    },
                                    ExecutionRoleArn='ARN for AmazonSageMaker-ExecutionRole'
                            )

print ("create_model API response", create_model_api_response)

# create sagemaker endpoint config
create_endpoint_config_api_response = client.create_endpoint_config(
                                            EndpointConfigName='sagemaker-neomxnet-endpoint-configuration',
                                            ProductionVariants=[
                                                {
                                                    'VariantName': <provide your variant name>,
                                                    'ModelName': 'my-sagemaker-model',
                                                    'InitialInstanceCount': 1,
                                                    'InstanceType': <provide your instance type here>
                                                },
                                            ]
                                       )

print ("create_endpoint_config API response", create_endpoint_config_api_response)

# create sagemaker endpoint
create_endpoint_api_response = client.create_endpoint(
                                    EndpointName='provide your endpoint name',
                                    EndpointConfigName=<insert your endpoint config name>,
                                )

print ("create_endpoint API response", create_endpoint_api_response)
```

**Note**  
The `AmazonSageMakerFullAccess` and `AmazonS3ReadOnlyAccess` policies must be attached to the `AmazonSageMaker-ExecutionRole` IAM role. 

For full syntax of `create_model`, `create_endpoint_config`, and `create_endpoint` APIs, see [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint), respectively. 

If you did not train your model using SageMaker AI, specify the following environment variables: 

------
#### [ MXNet and PyTorch ]

```
"Environment": {
    "SAGEMAKER_PROGRAM": "inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
    "SAGEMAKER_REGION": "insert your region",
    "MMS_DEFAULT_RESPONSE_TIMEOUT": "500"
}
```

------
#### [ TensorFlow ]

```
"Environment": {
    "SAGEMAKER_PROGRAM": "inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
    "SAGEMAKER_REGION": "insert your region"
}
```

------

 If you trained your model using SageMaker AI, specify the environment variable `SAGEMAKER_SUBMIT_DIRECTORY` as the full Amazon S3 bucket URI that contains the training script. 

# Deploy a Compiled Model Using the AWS CLI
<a name="neo-deployment-hosting-services-cli"></a>

You must satisfy the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites) section if the model was compiled using AWS SDK for Python (Boto3), AWS CLI, or the Amazon SageMaker AI console. Follow the steps below to create and deploy a SageMaker Neo-compiled model using the [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/). 

**Topics**
+ [

## Deploy the Model
](#neo-deploy-cli)

## Deploy the Model
<a name="neo-deploy-cli"></a>

After you have satisfied the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites), use the `create-model`, `create-enpoint-config`, and `create-endpoint` AWS CLI commands. The following steps explain how to use these commands to deploy a model compiled with Neo: 



### Create a Model
<a name="neo-deployment-hosting-services-cli-create-model"></a>

From [Neo Inference Container Images](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html), select the inference image URI and then use `create-model` API to create a SageMaker AI model. You can do this with two steps: 

1. Create a `create_model.json` file. Within the file, specify the name of the model, the image URI, the path to the `model.tar.gz` file in your Amazon S3 bucket, and your SageMaker AI execution role: 

   ```
   {
       "ModelName": "insert model name",
       "PrimaryContainer": {
           "Image": "insert the ECR Image URI",
           "ModelDataUrl": "insert S3 archive URL",
           "Environment": {"See details below"}
       },
       "ExecutionRoleArn": "ARN for AmazonSageMaker-ExecutionRole"
   }
   ```

   If you trained your model using SageMaker AI, specify the following environment variable: 

   ```
   "Environment": {
       "SAGEMAKER_SUBMIT_DIRECTORY" : "[Full S3 path for *.tar.gz file containing the training script]"
   }
   ```

   If you did not train your model using SageMaker AI, specify the following environment variables: 

------
#### [ MXNet and PyTorch ]

   ```
   "Environment": {
       "SAGEMAKER_PROGRAM": "inference.py",
       "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
       "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
       "SAGEMAKER_REGION": "insert your region",
       "MMS_DEFAULT_RESPONSE_TIMEOUT": "500"
   }
   ```

------
#### [ TensorFlow ]

   ```
   "Environment": {
       "SAGEMAKER_PROGRAM": "inference.py",
       "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
       "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
       "SAGEMAKER_REGION": "insert your region"
   }
   ```

------
**Note**  
The `AmazonSageMakerFullAccess` and `AmazonS3ReadOnlyAccess` policies must be attached to the `AmazonSageMaker-ExecutionRole` IAM role. 

1. Run the following command:

   ```
   aws sagemaker create-model --cli-input-json file://create_model.json
   ```

   For the full syntax of the `create-model` API, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). 

### Create an Endpoint Configuration
<a name="neo-deployment-hosting-services-cli-create-endpoint-config"></a>

After creating a SageMaker AI model, create the endpoint configuration using the `create-endpoint-config` API. To do this, create a JSON file with your endpoint configuration specifications. For example, you can use the following code template and save it as `create_config.json`: 

```
{
    "EndpointConfigName": "<provide your endpoint config name>",
    "ProductionVariants": [
        {
            "VariantName": "<provide your variant name>",
            "ModelName": "my-sagemaker-model",
            "InitialInstanceCount": 1,
            "InstanceType": "<provide your instance type here>",
            "InitialVariantWeight": 1.0
        }
    ]
}
```

Now run the following AWS CLI command to create your endpoint configuration: 

```
aws sagemaker create-endpoint-config --cli-input-json file://create_config.json
```

For the full syntax of the `create-endpoint-config` API, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html). 

### Create an Endpoint
<a name="neo-deployment-hosting-services-cli-create-endpoint"></a>

After you have created your endpoint configuration, create an endpoint using the `create-endpoint` API: 

```
aws sagemaker create-endpoint --endpoint-name '<provide your endpoint name>' --endpoint-config-name '<insert your endpoint config name>'
```

For the full syntax of the `create-endpoint` API, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html). 

# Deploy a Compiled Model Using the Console
<a name="neo-deployment-hosting-services-console"></a>

You must satisfy the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites) section if the model was compiled using AWS SDK for Python (Boto3), the AWS CLI, or the Amazon SageMaker AI console. Follow the steps below to create and deploy a SageMaker AI Neo-compiled model using the SageMaker AI console[https://console.aws.amazon.com/ SageMaker AI](https://console.aws.amazon.com/sagemaker/).

**Topics**
+ [

## Deploy the Model
](#deploy-the-model-console-steps)

## Deploy the Model
<a name="deploy-the-model-console-steps"></a>

 After you have satisfied the [ prerequisites](https://docs.aws.amazon.com//sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites), use the following steps to deploy a model compiled with Neo: 

1. Choose **Models**, and then choose **Create models** from the **Inference** group. On the **Create model** page, complete the **Model name**,** IAM role**, and **VPC** fields (optional), if needed.  
![\[Create Neo model for inference\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/create-pipeline-model.png)

1. To add information about the container used to deploy your model, choose **Add container** container, then choose **Next**. Complete the **Container input options**, **Location of inference code image**, and **Location of model artifacts**, and optionally, **Container host name**, and **Environmental variables** fields.  
![\[Create Neo model for inference\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo-deploy-console-container-definition.png)

1. To deploy Neo-compiled models, choose the following:
   + **Container input options**: Choose **Provide model artifacts and inference image**.
   + **Location of inference code image**: Choose the inference image URI from [Neo Inference Container Images](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-container-images.html), depending on the AWS Region and kind of application. 
   + **Location of model artifact**: Enter the Amazon S3 bucket URI of the compiled model artifact generated by the Neo compilation API.
   + **Environment variables**:
     + Leave this field blank for **SageMaker XGBoost**.
     + If you trained your model using SageMaker AI, specify the environment variable `SAGEMAKER_SUBMIT_DIRECTORY` as the Amazon S3 bucket URI that contains the training script. 
     + If you did not train your model using SageMaker AI, specify the following environment variables:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-console.html)

1. Confirm that the information for the containers is accurate, and then choose **Create model**. On the **Create model landing page**, choose **Create endpoint**.   
![\[Create Model landing page\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo-deploy-console-create-model-land-page.png)

1. In **Create and configure endpoint** diagram, specify the **Endpoint name**. For **Attach endpoint configuration**, choose **Create a new endpoint configuration**.  
![\[Neo console create and configure endpoint UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo-deploy-console-config-endpoint.png)

1. In **New endpoint configuration** page, specify the **Endpoint configuration name**.   
![\[Neo console new endpoint configuration UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo-deploy-console-new-endpoint-config.png)

1. Choose **Edit** next to the name of the model and specify the correct **Instance type** on the **Edit Production Variant** page. It is imperative that the **Instance type** value match the one specified in your compilation job.  
![\[Neo console new endpoint configuration UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/neo-deploy-console-edit-production-variant.png)

1. Choose **Save**.

1. On the **New endpoint configuration** page, choose **Create endpoint configuration**, and then choose **Create endpoint**. 

# Inference Requests With a Deployed Service
<a name="neo-requests"></a>

If you have followed instructions in [Deploy a Model](neo-deployment-hosting-services.md), you should have a SageMaker AI endpoint set up and running. Regardless of how you deployed your Neo-compiled model, there are three ways you can submit inference requests: 

**Topics**
+ [

# Request Inferences from a Deployed Service (Amazon SageMaker SDK)
](neo-requests-sdk.md)
+ [

# Request Inferences from a Deployed Service (Boto3)
](neo-requests-boto3.md)
+ [

# Request Inferences from a Deployed Service (AWS CLI)
](neo-requests-cli.md)

# Request Inferences from a Deployed Service (Amazon SageMaker SDK)
<a name="neo-requests-sdk"></a>

Use the following the code examples to request inferences from your deployed service based on the framework you used to train your model. The code examples for the different frameworks are similar. The main difference is that TensorFlow requires `application/json` as the content type. 

 

## PyTorch and MXNet
<a name="neo-requests-sdk-py-mxnet"></a>

 If you are using **PyTorch v1.4 or later** or **MXNet 1.7.0 or later** and you have an Amazon SageMaker AI endpoint `InService`, you can make inference requests using the `predictor` package of the SageMaker AI SDK for Python. 

**Note**  
The API varies based on the SageMaker AI SDK for Python version:  
For version 1.x, use the [https://sagemaker.readthedocs.io/en/v1.72.0/api/inference/predictors.html#sagemaker.predictor.RealTimePredictor](https://sagemaker.readthedocs.io/en/v1.72.0/api/inference/predictors.html#sagemaker.predictor.RealTimePredictor) and [https://sagemaker.readthedocs.io/en/v1.72.0/api/inference/predictors.html#sagemaker.predictor.RealTimePredictor.predict](https://sagemaker.readthedocs.io/en/v1.72.0/api/inference/predictors.html#sagemaker.predictor.RealTimePredictor.predict) API.
For version 2.x, use the [https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor) and the [https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict) API.

The following code example shows how to use these APIs to send an image for inference: 

------
#### [ SageMaker Python SDK v1.x ]

```
from sagemaker.predictor import RealTimePredictor

endpoint = 'insert name of your endpoint here'

# Read image into memory
payload = None
with open("image.jpg", 'rb') as f:
    payload = f.read()

predictor = RealTimePredictor(endpoint=endpoint, content_type='application/x-image')
inference_response = predictor.predict(data=payload)
print (inference_response)
```

------
#### [ SageMaker Python SDK v2.x ]

```
from sagemaker.predictor import Predictor

endpoint = 'insert name of your endpoint here'

# Read image into memory
payload = None
with open("image.jpg", 'rb') as f:
    payload = f.read()
    
predictor = Predictor(endpoint)
inference_response = predictor.predict(data=payload)
print (inference_response)
```

------

## TensorFlow
<a name="neo-requests-sdk-py-tf"></a>

The following code example shows how to use the SageMaker Python SDK API to send an image for inference: 

```
from sagemaker.predictor import Predictor
from PIL import Image
import numpy as np
import json

endpoint = 'insert the name of your endpoint here'

# Read image into memory
image = Image.open(input_file)
batch_size = 1
image = np.asarray(image.resize((224, 224)))
image = image / 128 - 1
image = np.concatenate([image[np.newaxis, :, :]] * batch_size)
body = json.dumps({"instances": image.tolist()})
    
predictor = Predictor(endpoint)
inference_response = predictor.predict(data=body)
print(inference_response)
```

# Request Inferences from a Deployed Service (Boto3)
<a name="neo-requests-boto3"></a>

 You can submit inference requests using SageMaker AI SDK for Python (Boto3) client and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) API once you have an SageMaker AI endpoint `InService`. The following code example shows how to send an image for inference: 

------
#### [ PyTorch and MXNet ]

```
import boto3

import json
 
endpoint = 'insert name of your endpoint here'
 
runtime = boto3.Session().client('sagemaker-runtime')
 
# Read image into memory
with open(image, 'rb') as f:
    payload = f.read()
# Send image via InvokeEndpoint API
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/x-image', Body=payload)

# Unpack response
result = json.loads(response['Body'].read().decode())
```

------
#### [ TensorFlow ]

For TensorFlow submit an input with `application/json` for the content type. 

```
from PIL import Image
import numpy as np
import json
import boto3

client = boto3.client('sagemaker-runtime') 
input_file = 'path/to/image'
image = Image.open(input_file)
batch_size = 1
image = np.asarray(image.resize((224, 224)))
image = image / 128 - 1
image = np.concatenate([image[np.newaxis, :, :]] * batch_size)
body = json.dumps({"instances": image.tolist()})
ioc_predictor_endpoint_name = 'insert name of your endpoint here'
content_type = 'application/json'   
ioc_response = client.invoke_endpoint(
    EndpointName=ioc_predictor_endpoint_name,
    Body=body,
    ContentType=content_type
 )
```

------
#### [ XGBoost ]

 For an XGBoost application, you should submit a CSV text instead: 

```
import boto3
import json
 
endpoint = 'insert your endpoint name here'
 
runtime = boto3.Session().client('sagemaker-runtime')
 
csv_text = '1,-1.0,1.0,1.5,2.6'
# Send CSV text via InvokeEndpoint API
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='text/csv', Body=csv_text)
# Unpack response
result = json.loads(response['Body'].read().decode())
```

------

 Note that BYOM allows for a custom content type. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html). 

# Request Inferences from a Deployed Service (AWS CLI)
<a name="neo-requests-cli"></a>

Inference requests can be made with the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html) once you have an Amazon SageMaker AI endpoint `InService`. You can make inference requests with the AWS Command Line Interface (AWS CLI). The following example shows how to send an image for inference: 

```
aws sagemaker-runtime invoke-endpoint --endpoint-name 'insert name of your endpoint here' --body fileb://image.jpg --content-type=application/x-image output_file.txt
```

An `output_file.txt` with information about your inference requests is made if the inference was successful. 

 For TensorFlow submit an input with `application/json` as the content type. 

```
aws sagemaker-runtime invoke-endpoint --endpoint-name 'insert name of your endpoint here' --body fileb://input.json --content-type=application/json output_file.txt
```

# Inference Container Images
<a name="neo-deployment-hosting-services-container-images"></a>

SageMaker Neo now provides inference image URI information for `ml_*` targets. For more information see [DescribeCompilationJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCompilationJob.html#sagemaker-DescribeCompilationJob-response-InferenceImage).

Based on your use case, replace the highlighted portion in the inference image URI template provided below with appropriate values. 

## Amazon SageMaker AI XGBoost
<a name="inference-container-collapse-xgboost"></a>

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/xgboost-neo:latest
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used.

## Keras
<a name="inference-container-collapse-keras"></a>

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-neo-keras:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used.

Replace *fx\$1version* with `2.2.4`.

Replace *instance\$1type* with either `cpu` or `gpu`.

## MXNet
<a name="inference-container-collapse-mxnet"></a>

------
#### [ CPU or GPU instance types ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-inference-mxnet:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. 

Replace *fx\$1version* with `1.8.0`. 

Replace *instance\$1type* with either `cpu` or `gpu`. 

------
#### [ Inferentia1 ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-neo-mxnet:fx_version-instance_type-py3
```

Replace *aws\$1region* with either `us-east-1` or `us-west-2`. 

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. 

Replace *fx\$1version* with `1.5.1`. 

Replace *`instance_type`* with `inf`.

------

## ONNX
<a name="inference-container-collapse-onnx"></a>

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-neo-onnx:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used.

Replace *fx\$1version* with `1.5.0`.

Replace *instance\$1type* with either `cpu` or `gpu`.

## PyTorch
<a name="inference-container-collapse-pytorch"></a>

------
#### [ CPU or GPU instance types ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-inference-pytorch:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. 

Replace *fx\$1version* with `1.4`, `1.5`, `1.6`, `1.7`, `1.8`, `1.12`, `1.13`, or `2.0`.

Replace *instance\$1type* with either `cpu` or `gpu`. 

------
#### [ Inferentia1 ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-neo-pytorch:fx_version-instance_type-py3
```

Replace *aws\$1region* with either `us-east-1` or `us-west-2`. 

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. 

Replace *fx\$1version* with `1.5.1`. 

Replace *`instance_type`* with `inf`.

------
#### [ Inferentia2 and Trainium1 ]

```
763104351884.dkr.ecr.aws_region.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py38-sdk2.10.0-ubuntu20.04
```

Replace *aws\$1region* with `us-east-2` for Inferentia2, and `us-east-1` for Trainium1.

------

## TensorFlow
<a name="inference-container-collapse-tf"></a>

------
#### [ CPU or GPU instance types ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-inference-tensorflow:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. 

Replace *fx\$1version* with `1.15.3` or `2.9`. 

Replace *instance\$1type* with either `cpu` or `gpu`. 

------
#### [ Inferentia1 ]

```
aws_account_id.dkr.ecr.aws_region.amazonaws.com/sagemaker-neo-tensorflow:fx_version-instance_type-py3
```

Replace *aws\$1account\$1id* from the table at the end of this page based on the *aws\$1region* you used. Note that for instance type `inf` only `us-east-1` and `us-west-2` are supported.

Replace *fx\$1version* with `1.15.0`

Replace *instance\$1type* with `inf`.

------
#### [ Inferentia2 and Trainium1 ]

```
763104351884.dkr.ecr.aws_region.amazonaws.com/tensorflow-inference-neuronx:2.10.1-neuronx-py38-sdk2.10.0-ubuntu20.04
```

Replace *aws\$1region* with `us-east-2` for Inferentia2, and `us-east-1` for Trainium1.

------

The following table maps *aws\$1account\$1id* with *aws\$1region*. Use this table to find the correct inference image URI you need for your application. 


| aws\$1account\$1id | aws\$1region | 
| --- | --- | 
| 785573368785 | us-east-1 | 
| 007439368137 | us-east-2 | 
| 710691900526 | us-west-1 | 
| 301217895009 | us-west-2 | 
| 802834080501 | eu-west-1 | 
| 205493899709 | eu-west-2 | 
| 254080097072 | eu-west-3 | 
| 601324751636 | eu-north-1 | 
| 966458181534 | eu-south-1 | 
| 746233611703 | eu-central-1 | 
| 110948597952 | ap-east-1 | 
| 763008648453 | ap-south-1 | 
| 941853720454 | ap-northeast-1 | 
| 151534178276 | ap-northeast-2 | 
| 925152966179 | ap-northeast-3 | 
| 324986816169 | ap-southeast-1 | 
| 355873309152 | ap-southeast-2 | 
| 474822919863 | cn-northwest-1 | 
| 472730292857 | cn-north-1 | 
| 756306329178 | sa-east-1 | 
| 464438896020 | ca-central-1 | 
| 836785723513 | me-south-1 | 
| 774647643957 | af-south-1 | 
| 275950707576 | il-central-1 | 

# Edge Devices
<a name="neo-edge-devices"></a>

Amazon SageMaker Neo provides compilation support for popular machine learning frameworks. You can deploy your Neo-compiled edge devices such as the Raspberry Pi 3, Texas Instruments' Sitara, Jetson TX1, and more. For a full list of supported frameworks and edge devices, see [Supported Frameworks, Devices, Systems, and Architectures](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html). 

You must configure your edge device so that it can use AWS services. One way to do this is to install DLR and Boto3 to your device. To do this, you must set up the authentication credentials. See [Boto3 AWS Configuration](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration) for more information. Once your model is compiled and your edge device is configured, you can download the model from Amazon S3 to your edge device. From there, you can use the [Deep Learning Runtime (DLR)](https://neo-ai-dlr.readthedocs.io/en/latest/index.html) to read the compiled model and make inferences. 

For first-time users, we recommend you check out the [Getting Started](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-getting-started-edge.html) guide. This guide walks you through how to set up your credentials, compile a model, deploy your model to a Raspberry Pi 3, and make inferences on images. 

**Topics**
+ [

# Supported Frameworks, Devices, Systems, and Architectures
](neo-supported-devices-edge.md)
+ [

# Deploy Models
](neo-deployment-edge.md)
+ [

# Set up Neo on Edge Devices
](neo-getting-started-edge.md)

# Supported Frameworks, Devices, Systems, and Architectures
<a name="neo-supported-devices-edge"></a>

Amazon SageMaker Neo supports common machine learning frameworks, edge devices, operating systems, and chip architectures. Find out if Neo supports your framework, edge device, OS, and chip architecture by selecting one of the topics below.

You can find a list of models that have been tested by the Amazon SageMaker Neo Team in the [Tested Models](neo-supported-edge-tested-models.md) section.

**Note**  
Ambarella devices require additional files to be included within the compressed TAR file before it is sent for compilation. For more information, see [Troubleshoot Ambarella Errors](neo-troubleshooting-target-devices-ambarella.md).
TIM-VX (libtim-vx.so) is required for i.MX 8M Plus. For information on how to build TIM-VX, see the [TIM-VX GitHub repository](https://github.com/VeriSilicon/TIM-VX).

**Topics**
+ [

# Supported Frameworks
](neo-supported-devices-edge-frameworks.md)
+ [

# Supported Devices, Chip Architectures, and Systems
](neo-supported-devices-edge-devices.md)
+ [

# Tested Models
](neo-supported-edge-tested-models.md)

# Supported Frameworks
<a name="neo-supported-devices-edge-frameworks"></a>

Amazon SageMaker Neo supports the following frameworks. 


| Framework | Framework Version | Model Version | Models | Model Formats (packaged in \$1.tar.gz) | Toolkits | 
| --- | --- | --- | --- | --- | --- | 
| MXNet | 1.8 | Supports 1.8 or earlier | Image Classification, Object Detection, Semantic Segmentation, Pose Estimation, Activity Recognition | One symbol file (.json) and one parameter file (.params) | GluonCV v0.8.0 | 
| ONNX | 1.7 | Supports 1.7 or earlier | Image Classification, SVM | One model file (.onnx) |  | 
| Keras | 2.2 | Supports 2.2 or earlier | Image Classification | One model definition file (.h5) |  | 
| PyTorch | 1.7, 1.8 | Supports 1.7, 1.8 or earlier | Image Classification, Object Detection | One model definition file (.pth) |  | 
| TensorFlow | 1.15, 2.4, 2.5 (only for ml.inf1.\$1 instances) | Supports 1.15, 2.4, 2.5 (only for ml.inf1.\$1 instances) or earlier | Image Classification, Object Detection | \$1For saved models, one .pb or one .pbtxt file and a variables directory that contains variables \$1For frozen models, only one .pb or .pbtxt file |  | 
| TensorFlow-Lite | 1.15 | Supports 1.15 or earlier | Image Classification, Object Detection | One model definition flatbuffer file (.tflite) |  | 
| XGBoost | 1.3 | Supports 1.3 or earlier | Decision Trees | One XGBoost model file (.model) where the number of nodes in a tree is less than 2^31 |  | 
| DARKNET |  |  | Image Classification, Object Detection (Yolo model is not supported) | One config (.cfg) file and one weights (.weights) file |  | 

# Supported Devices, Chip Architectures, and Systems
<a name="neo-supported-devices-edge-devices"></a>

Amazon SageMaker Neo supports the following devices, chip architectures, and operating systems.

## Devices
<a name="neo-supported-edge-devices"></a>

You can select a device using the dropdown list in the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker) or by specifying the `TargetDevice` in the output configuration of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCompilationJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCompilationJob.html) API.

You can choose from one of the following edge devices: 


| Device List | System on a Chip (SoC) | Operating System | Architecture | Accelerator | Compiler Options Example | 
| --- | --- | --- | --- | --- | --- | 
| aisage | None | Linux | ARM64 | Mali | None | 
| amba\$1cv2 | CV2 | Arch Linux | ARM64 | cvflow | None | 
| amba\$1cv22 | CV22 | Arch Linux | ARM64 | cvflow | None | 
| amba\$1cv25 | CV25 | Arch Linux | ARM64 | cvflow | None | 
| coreml | None | iOS, macOS | None | None | \$1"class\$1labels": "imagenet\$1labels\$11000.txt"\$1 | 
| imx8qm | NXP imx8 | Linux | ARM64 | None | None | 
| imx8mplus | i.MX 8M Plus | Linux | ARM64 | NPU | None | 
| jacinto\$1tda4vm | TDA4VM | Linux | ARM | TDA4VM | None | 
| jetson\$1nano | None | Linux | ARM64 | NVIDIA | \$1'gpu-code': 'sm\$153', 'trt-ver': '5.0.6', 'cuda-ver': '10.0'\$1For `TensorFlow2`, `{'JETPACK_VERSION': '4.6', 'gpu_code': 'sm_72'}` | 
| jetson\$1tx1 | None | Linux | ARM64 | NVIDIA | \$1'gpu-code': 'sm\$153', 'trt-ver': '6.0.1', 'cuda-ver': '10.0'\$1 | 
| jetson\$1tx2 | None | Linux | ARM64 | NVIDIA | \$1'gpu-code': 'sm\$162', 'trt-ver': '6.0.1', 'cuda-ver': '10.0'\$1 | 
| jetson\$1xavier | None | Linux | ARM64 | NVIDIA | \$1'gpu-code': 'sm\$172', 'trt-ver': '5.1.6', 'cuda-ver': '10.0'\$1 | 
| qcs605 | None | Android | ARM64 | Mali | \$1'ANDROID\$1PLATFORM': 27\$1 | 
| qcs603 | None | Android | ARM64 | Mali | \$1'ANDROID\$1PLATFORM': 27\$1 | 
| rasp3b | ARM A56 | Linux | ARM\$1EABIHF | None | \$1'mattr': ['\$1neon']\$1 | 
| rasp4b | ARM A72 | None | None | None | None | 
| rk3288 | None | Linux | ARM\$1EABIHF | Mali | None | 
| rk3399 | None | Linux | ARM64 | Mali | None | 
| sbe\$1c | None | Linux | x86\$164 | None | \$1'mcpu': 'core-avx2'\$1 | 
| sitara\$1am57x | AM57X | Linux | ARM64 | EVE and/or C66x DSP | None | 
| x86\$1win32 | None | Windows 10 | X86\$132 | None | None | 
| x86\$1win64 | None | Windows 10 | X86\$132 | None | None | 

For more information about JSON key-value compiler options for each target device, see the `CompilerOptions` field in the [`OutputConfig` API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html) data type.

## Systems and Chip Architectures
<a name="neo-supported-edge-granular"></a>

The following look-up tables provide information regarding available operating systems and architectures for Neo model compilation jobs. 

------
#### [ Linux ]


| Accelerator | X86\$164 | X86 | ARM64 | ARM\$1EABIHF | ARM\$1EABI | 
| --- | --- | --- | --- | --- | --- | 
| No accelerator (CPU) | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | 
| Nvidia GPU | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | 
| Intel\$1Graphics | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | 
| ARM Mali | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | 

------
#### [ Android ]


| Accelerator | X86\$164 | X86 | ARM64 | ARM\$1EABIHF | ARM\$1EABI | 
| --- | --- | --- | --- | --- | --- | 
| No accelerator (CPU) | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | 
| Nvidia GPU | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | 
| Intel\$1Graphics | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | 
| ARM Mali | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | 

------
#### [ Windows ]


| Accelerator | X86\$164 | X86 | ARM64 | ARM\$1EABIHF | ARM\$1EABI | 
| --- | --- | --- | --- | --- | --- | 
| No accelerator (CPU) | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/success_icon.svg) Yes | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | ![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/negative_icon.svg) No | 

------

# Tested Models
<a name="neo-supported-edge-tested-models"></a>

The following collapsible sections provide information about machine learning models that were tested by the Amazon SageMaker Neo team. Expand the collapsible section based on your framework to check if a model was tested.

**Note**  
This is not a comprehensive list of models that can be compiled with Neo.

See [Supported Frameworks](neo-supported-devices-edge-frameworks.md) and [SageMaker AI Neo Supported Operators](https://aws.amazon.com/releasenotes/sagemaker-neo-supported-frameworks-and-operators/) to find out if you can compile your model with SageMaker Neo.

## DarkNet
<a name="collapsible-section-01"></a>


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| Alexnet |  |  |  |  |  |  |  |  |  | 
| Resnet50 | X | X |  | X | X | X |  | X | X | 
| YOLOv2 |  |  |  | X | X | X |  | X | X | 
| YOLOv2\$1tiny | X | X |  | X | X | X |  | X | X | 
| YOLOv3\$1416 |  |  |  | X | X | X |  | X | X | 
| YOLOv3\$1tiny | X | X |  | X | X | X |  | X | X | 

## MXNet
<a name="collapsible-section-02"></a>


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| Alexnet |  |  | X |  |  |  |  |  |  | 
| Densenet121 |  |  | X |  |  |  |  |  |  | 
| DenseNet201 | X | X | X | X | X | X |  | X | X | 
| GoogLeNet | X | X |  | X | X | X |  | X | X | 
| InceptionV3 |  |  |  | X | X | X |  | X | X | 
| MobileNet0.75 | X | X |  | X | X | X |  |  | X | 
| MobileNet1.0 | X | X | X | X | X | X |  |  | X | 
| MobileNetV2\$10.5 | X | X |  | X | X | X |  |  | X | 
| MobileNetV2\$11.0 | X | X | X | X | X | X | X | X | X | 
| MobileNetV3\$1Large | X | X | X | X | X | X | X | X | X | 
| MobileNetV3\$1Small | X | X | X | X | X | X | X | X | X | 
| ResNeSt50 |  |  |  | X | X |  |  | X | X | 
| ResNet18\$1v1 | X | X | X | X | X | X |  |  | X | 
| ResNet18\$1v2 | X | X |  | X | X | X |  |  | X | 
| ResNet50\$1v1 | X | X | X | X | X | X |  | X | X | 
| ResNet50\$1v2 | X | X | X | X | X | X |  | X | X | 
| ResNext101\$132x4d |  |  |  |  |  |  |  |  |  | 
| ResNext50\$132x4d | X |  | X | X | X |  |  | X | X | 
| SENet\$1154 |  |  |  | X | X | X |  | X | X | 
| SE\$1ResNext50\$132x4d | X | X |  | X | X | X |  | X | X | 
| SqueezeNet1.0 | X | X | X | X | X | X |  |  | X | 
| SqueezeNet1.1 | X | X | X | X | X | X |  | X | X | 
| VGG11 | X | X | X | X | X |  |  | X | X | 
| Xception | X | X | X | X | X | X |  | X | X | 
| darknet53 | X | X |  | X | X | X |  | X | X | 
| resnet18\$1v1b\$10.89 | X | X |  | X | X | X |  |  | X | 
| resnet50\$1v1d\$10.11 | X | X |  | X | X | X |  |  | X | 
| resnet50\$1v1d\$10.86 | X | X | X | X | X | X |  | X | X | 
| ssd\$1512\$1mobilenet1.0\$1coco | X |  | X | X | X | X |  | X | X | 
| ssd\$1512\$1mobilenet1.0\$1voc | X |  | X | X | X | X |  | X | X | 
| ssd\$1resnet50\$1v1 | X |  | X | X | X |  |  | X | X | 
| yolo3\$1darknet53\$1coco | X |  |  | X | X |  |  | X | X | 
| yolo3\$1mobilenet1.0\$1coco | X | X |  | X | X | X |  | X | X | 
| deeplab\$1resnet50 |  |  | X |  |  |  |  |  |  | 

## Keras
<a name="collapsible-section-03"></a>


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| densenet121 | X | X | X | X | X | X |  | X | X | 
| densenet201 | X | X | X | X | X | X |  |  | X | 
| inception\$1v3 | X | X |  | X | X | X |  | X | X | 
| mobilenet\$1v1 | X | X | X | X | X | X |  | X | X | 
| mobilenet\$1v2 | X | X | X | X | X | X |  | X | X | 
| resnet152\$1v1 |  |  |  | X | X |  |  |  | X | 
| resnet152\$1v2 |  |  |  | X | X |  |  |  | X | 
| resnet50\$1v1 | X | X | X | X | X |  |  | X | X | 
| resnet50\$1v2 | X | X | X | X | X | X |  | X | X | 
| vgg16 |  |  | X | X | X |  |  | X | X | 

## ONNX
<a name="collapsible-section-04"></a>


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| alexnet |  |  | X |  |  |  |  |  |  | 
| mobilenetv2-1.0 | X | X | X | X | X | X |  | X | X | 
| resnet18v1 | X |  |  | X | X |  |  |  | X | 
| resnet18v2 | X |  |  | X | X |  |  |  | X | 
| resnet50v1 | X |  | X | X | X |  |  | X | X | 
| resnet50v2 | X |  | X | X | X |  |  | X | X | 
| resnet152v1 |  |  |  | X | X | X |  |  | X | 
| resnet152v2 |  |  |  | X | X | X |  |  | X | 
| squeezenet1.1 | X |  | X | X | X | X |  | X | X | 
| vgg19 |  |  | X |  |  |  |  |  | X | 

## PyTorch (FP32)
<a name="collapsible-section-05"></a>


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Ambarella CV25 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| densenet121 | X | X | X | X | X | X | X |  | X | X | 
| inception\$1v3 |  | X |  |  | X | X | X |  | X | X | 
| resnet152 |  |  |  |  | X | X | X |  |  | X | 
| resnet18 | X | X |  |  | X | X | X |  |  | X | 
| resnet50 | X | X | X | X | X | X |  |  | X | X | 
| squeezenet1.0 | X | X |  |  | X | X | X |  |  | X | 
| squeezenet1.1 | X | X | X | X | X | X | X |  | X | X | 
| yolov4 |  |  |  |  | X | X |  |  |  |  | 
| yolov5 |  |  |  | X | X | X |  |  |  |  | 
| fasterrcnn\$1resnet50\$1fpn |  |  |  |  | X | X |  |  |  |  | 
| maskrcnn\$1resnet50\$1fpn |  |  |  |  | X | X |  |  |  |  | 

## TensorFlow
<a name="collapsible-section-06"></a>

------
#### [ TensorFlow ]


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Ambarella CV25 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| densenet201 | X | X | X | X | X | X | X |  | X | X | 
| inception\$1v3 | X | X | X |  | X | X | X |  | X | X | 
| mobilenet100\$1v1 | X | X | X |  | X | X | X |  |  | X | 
| mobilenet100\$1v2.0 | X | X | X |  | X | X | X |  | X | X | 
| mobilenet130\$1v2 | X | X |  |  | X | X | X |  |  | X | 
| mobilenet140\$1v2 | X | X | X |  | X | X | X |  | X | X | 
| resnet50\$1v1.5 | X | X |  |  | X | X | X |  | X | X | 
| resnet50\$1v2 | X | X | X | X | X | X | X |  | X | X | 
| squeezenet | X | X | X | X | X | X | X |  | X | X | 
| mask\$1rcnn\$1inception\$1resnet\$1v2 |  |  |  |  | X |  |  |  |  |  | 
| ssd\$1mobilenet\$1v2 |  |  |  |  | X | X |  |  |  |  | 
| faster\$1rcnn\$1resnet50\$1lowproposals |  |  |  |  | X |  |  |  |  |  | 
| rfcn\$1resnet101 |  |  |  |  | X |  |  |  |  |  | 

------
#### [ TensorFlow.Keras ]


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| DenseNet121  | X | X |  | X | X | X |  | X | X | 
| DenseNet201 | X | X |  | X | X | X |  |  | X | 
| InceptionV3 | X | X |  | X | X | X |  | X | X | 
| MobileNet | X | X |  | X | X | X |  | X | X | 
| MobileNetv2 | X | X |  | X | X | X |  | X | X | 
| NASNetLarge |  |  |  | X | X |  |  | X | X | 
| NASNetMobile | X | X |  | X | X | X |  | X | X | 
| ResNet101 |  |  |  | X | X | X |  |  | X | 
| ResNet101V2 |  |  |  | X | X | X |  |  | X | 
| ResNet152 |  |  |  | X | X |  |  |  | X | 
| ResNet152v2 |  |  |  | X | X |  |  |  | X | 
| ResNet50 | X | X |  | X | X |  |  | X | X | 
| ResNet50V2 | X | X |  | X | X | X |  | X | X | 
| VGG16 |  |  |  | X | X |  |  | X | X | 
| Xception | X | X |  | X | X | X |  | X | X | 

------

## TensorFlow-Lite
<a name="collapsible-section-07"></a>

------
#### [ TensorFlow-Lite (FP32) ]


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | i.MX 8M Plus | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| densenet\$12018\$104\$127 | X |  |  | X | X | X |  |  | X |  | 
| inception\$1resnet\$1v2\$12018\$104\$127 |  |  |  | X | X | X |  |  | X |  | 
| inception\$1v3\$12018\$104\$127 |  |  |  | X | X | X |  |  | X | X | 
| inception\$1v4\$12018\$104\$127 |  |  |  | X | X | X |  |  | X | X | 
| mnasnet\$10.5\$1224\$109\$107\$12018 | X |  |  | X | X | X |  |  | X |  | 
| mnasnet\$11.0\$1224\$109\$107\$12018 | X |  |  | X | X | X |  |  | X |  | 
| mnasnet\$11.3\$1224\$109\$107\$12018 | X |  |  | X | X | X |  |  | X |  | 
| mobilenet\$1v1\$10.25\$1128 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$10.25\$1224 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$10.5\$1128 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$10.5\$1224 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$10.75\$1128 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$10.75\$1224 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$11.0\$1128 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v1\$11.0\$1192 | X |  |  | X | X | X |  |  | X | X | 
| mobilenet\$1v2\$11.0\$1224 | X |  |  | X | X | X |  |  | X | X | 
| resnet\$1v2\$1101 |  |  |  | X | X | X |  |  | X |  | 
| squeezenet\$12018\$104\$127 | X |  |  | X | X | X |  |  | X |  | 

------
#### [ TensorFlow-Lite (INT8) ]


| Models | ARM V8 | ARM Mali | Ambarella CV22 | Nvidia | Panorama | TI TDA4VM | Qualcomm QCS603 | X86\$1Linux | X86\$1Windows | i.MX 8M Plus | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| inception\$1v1 |  |  |  |  |  |  | X |  |  | X | 
| inception\$1v2 |  |  |  |  |  |  | X |  |  | X | 
| inception\$1v3 | X |  |  |  |  | X | X |  | X | X | 
| inception\$1v4\$1299 | X |  |  |  |  | X | X |  | X | X | 
| mobilenet\$1v1\$10.25\$1128 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$10.25\$1224 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$10.5\$1128 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$10.5\$1224 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$10.75\$1128 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$10.75\$1224 | X |  |  |  |  | X | X |  | X | X | 
| mobilenet\$1v1\$11.0\$1128 | X |  |  |  |  | X |  |  | X | X | 
| mobilenet\$1v1\$11.0\$1224 | X |  |  |  |  | X | X |  | X | X | 
| mobilenet\$1v2\$11.0\$1224 | X |  |  |  |  | X | X |  | X | X | 
| deeplab-v3\$1513 |  |  |  |  |  |  | X |  |  |  | 

------

# Deploy Models
<a name="neo-deployment-edge"></a>

You can deploy the compute module to resource-constrained edge devices by: downloading the compiled model from Amazon S3 to your device and using [DLR](https://github.com/neo-ai/neo-ai-dlr), or you can use [AWS IoT Greengrass](https://docs.aws.amazon.com/greengrass/latest/developerguide/what-is-gg.html).

Before moving on, make sure your edge device must be supported by SageMaker Neo. See, [Supported Frameworks, Devices, Systems, and Architectures](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html) to find out what edge devices are supported. Make sure that you specified your target edge device when you submitted the compilation job, see [Use Neo to Compile a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html).

## Deploy a Compiled Model (DLR)
<a name="neo-deployment-dlr"></a>

[DLR](https://github.com/neo-ai/neo-ai-dlr) is a compact, common runtime for deep learning models and decision tree models. DLR uses the [TVM](https://github.com/neo-ai/tvm) runtime, [Treelite](https://treelite.readthedocs.io/en/latest/install.html) runtime, NVIDIA TensorRT™, and can include other hardware-specific runtimes. DLR provides unified Python/C\$1\$1 APIs for loading and running compiled models on various devices.

You can install latest release of DLR package using the following pip command:

```
pip install dlr
```

For installation of DLR on GPU targets or non-x86 edge devices, please refer to [Releases](https://github.com/neo-ai/neo-ai-dlr/releases) for prebuilt binaries, or [Installing DLR](https://neo-ai-dlr.readthedocs.io/en/latest/install.html) for building DLR from source. For example, to install DLR for Raspberry Pi 3, you can use: 

```
pip install https://neo-ai-dlr-release.s3-us-west-2.amazonaws.com/v1.3.0/pi-armv7l-raspbian4.14.71-glibc2_24-libstdcpp3_4/dlr-1.3.0-py3-none-any.whl
```

## Deploy a Model (AWS IoT Greengrass)
<a name="neo-deployment-greengrass"></a>

[AWS IoT Greengrass](https://docs.aws.amazon.com/greengrass/latest/developerguide/what-is-gg.html) extends cloud capabilities to local devices. It enables devices to collect and analyze data closer to the source of information, react autonomously to local events, and communicate securely with each other on local networks. With AWS IoT Greengrass, you can perform machine learning inference at the edge on locally generated data using cloud-trained models. Currently, you can deploy models on to all AWS IoT Greengrass devices based on ARM Cortex-A, Intel Atom, and Nvidia Jetson series processors. For more information on deploying a Lambda inference application to perform machine learning inferences with AWS IoT Greengrass, see [ How to configure optimized machine learning inference using the AWS Management Console](https://docs.aws.amazon.com/greengrass/latest/developerguide/ml-dlc-console.html).

# Set up Neo on Edge Devices
<a name="neo-getting-started-edge"></a>

This guide to getting started with Amazon SageMaker Neo shows you how to compile a model, set up your device, and make inferences on your device. Most of the code examples use Boto3. We provide commands using AWS CLI where applicable, as well as instructions on how to satisfy prerequisites for Neo. 

**Note**  
You can run the following code snippets on your local machine, within a SageMaker notebook, within Amazon SageMaker Studio, or (depending on your edge device) on your edge device. The setup is similar; however, there are two main exceptions if you run this guide within a SageMaker notebook instance or SageMaker Studio session:   
You do not need to install Boto3.
You do not need to add the `‘AmazonSageMakerFullAccess’` IAM policy

 This guide assumes you are running the following instructions on your edge device. 

# Prerequisites
<a name="neo-getting-started-edge-step0"></a>

SageMaker Neo is a capability that allows you to train machine learning models once and run them anywhere in the cloud and at the edge. Before you can compile and optimize your models with Neo, there are a few prerequisites you need to set up. You must install the necessary Python libraries, configure your AWS credentials, create an IAM role with the required permissions, and set up an S3 bucket for storing model artifacts. You must also have a trained machine learning model ready. The following steps guide you through the setup:

1. **Install Boto3**

   If you are running these commands on your edge device, you must install the AWS SDK for Python (Boto3). Within a Python environment (preferably a virtual environment), run the following locally on your edge device's terminal or within a Jupyter notebook instance: 

------
#### [ Terminal ]

   ```
   pip install boto3
   ```

------
#### [ Jupyter Notebook ]

   ```
   !pip install boto3
   ```

------

1.  **Set Up AWS Credentials** 

   You need to set up Amazon Web Services credentials on your device in order to run SDK for Python (Boto3). By default, the AWS credentials should be stored in the file `~/.aws/credentials` on your edge device. Within the credentials file, you should see two environment variables: `aws_access_key_id` and `aws_secret_access_key`. 

   In your terminal, run: 

   ```
   $ more ~/.aws/credentials
   
   [default]
   aws_access_key_id = YOUR_ACCESS_KEY
   aws_secret_access_key = YOUR_SECRET_KEY
   ```

   The [AWS General Reference Guide](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) has instructions on how to get the necessary `aws_access_key_id` and `aws_secret_access_key`. For more information on how to set up credentials on your device, see the [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration) documentation. 

1.  **Set up an IAM Role and attach policies.** 

   Neo needs access to your S3 bucket URI. Create an IAM role that can run SageMaker AI and has permission to access the S3 URI. You can create an IAM role either by using SDK for Python (Boto3), the console, or the AWS CLI. The following example illustrates how to create an IAM role using SDK for Python (Boto3): 

   ```
   import boto3
   
   AWS_REGION = 'aws-region'
   
   # Create an IAM client to interact with IAM
   iam_client = boto3.client('iam', region_name=AWS_REGION)
   role_name = 'role-name'
   ```

   For more information on how to create an IAM role with the console, AWS CLI, or through the AWS API, see [Creating an IAM user in your AWS account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_api).

    Create a dictionary describing the IAM policy you are attaching. This policy is used to create a new IAM role. 

   ```
   policy = {
       'Statement': [
           {
               'Action': 'sts:AssumeRole',
               'Effect': 'Allow',
               'Principal': {'Service': 'sagemaker.amazonaws.com'},
           }],  
        'Version': '2012-10-17		 	 	 '
   }
   ```

   Create a new IAM role using the policy you defined above:

   ```
   import json 
   
   new_role = iam_client.create_role(
       AssumeRolePolicyDocument=json.dumps(policy),
       Path='/',
       RoleName=role_name
   )
   ```

   You need to know what your Amazon Resource Name (ARN) is when you create a compilation job in a later step, so store it in a variable as well. 

   ```
   role_arn = new_role['Role']['Arn']
   ```

    Now that you have created a new role, attach the permissions it needs to interact with Amazon SageMaker AI and Amazon S3: 

   ```
   iam_client.attach_role_policy(
       RoleName=role_name,
       PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
   )
   
   iam_client.attach_role_policy(
       RoleName=role_name,
       PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
   );
   ```

1. **Create an Amazon S3 bucket to store your model artifacts**

   SageMaker Neo will access your model artifacts from Amazon S3

------
#### [ Boto3 ]

   ```
   # Create an S3 client
   s3_client = boto3.client('s3', region_name=AWS_REGION)
   
   # Name buckets
   bucket='name-of-your-bucket'
   
   # Check if bucket exists
   if boto3.resource('s3').Bucket(bucket) not in boto3.resource('s3').buckets.all():
       s3_client.create_bucket(
           Bucket=bucket,
           CreateBucketConfiguration={
               'LocationConstraint': AWS_REGION
           }
       )
   else:
       print(f'Bucket {bucket} already exists. No action needed.')
   ```

------
#### [ CLI ]

   ```
   aws s3 mb s3://'name-of-your-bucket' --region specify-your-region 
   
   # Check your bucket exists
   aws s3 ls s3://'name-of-your-bucket'/
   ```

------

1. **Train a machine learning model**

   See [Train a Model with Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) for more information on how to train a machine learning model using Amazon SageMaker AI. You can optionally upload your locally trained model directly into an Amazon S3 URI bucket. 
**Note**  
 Make sure the model is correctly formatted depending on the framework you used. See [What input data shapes does SageMaker Neo expect?](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html#neo-job-compilation-expected-inputs) 

   If you do not have a model yet, use the `curl` command to get a local copy of the `coco_ssd_mobilenet` model from TensorFlow’s website. The model you just copied is an object detection model trained from the [COCO dataset](https://cocodataset.org/#home). Type the following into your Jupyter notebook:

   ```
   model_zip_filename = './coco_ssd_mobilenet_v1_1.0.zip'
   !curl http://storage.googleapis.com/download.tensorflow.org/models/tflite/coco_ssd_mobilenet_v1_1.0_quant_2018_06_29.zip \
       --output {model_zip_filename}
   ```

   Note that this particular example was packaged in a .zip file. Unzip this file and repackage it as a compressed tarfile (`.tar.gz`) before using it in later steps. Type the following into your Jupyter notebook: 

   ```
   # Extract model from zip file
   !unzip -u {model_zip_filename}
   
   model_filename = 'detect.tflite'
   model_name = model_filename.split('.')[0]
   
   # Compress model into .tar.gz so SageMaker Neo can use it
   model_tar = model_name + '.tar.gz'
   !tar -czf {model_tar} {model_filename}
   ```

1. **Upload trained model to an S3 bucket**

   Once you have trained your machine learning mode, store it in an S3 bucket. 

------
#### [ Boto3 ]

   ```
   # Upload model        
   s3_client.upload_file(Filename=model_filename, Bucket=bucket, Key=model_filename)
   ```

------
#### [ CLI ]

   Replace `your-model-filename` and `amzn-s3-demo-bucket` with the name of your S3 bucket. 

   ```
   aws s3 cp your-model-filename s3://amzn-s3-demo-bucket
   ```

------

# Compile the Model
<a name="neo-getting-started-edge-step1"></a>

Once you have satisfied the [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-getting-started-edge.html#neo-getting-started-edge-step0), you can compile your model with Amazon SageMaker AI Neo. You can compile your model using the AWS CLI, the console or the [Amazon Web Services SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html), see [Use Neo to Compile a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html). In this example, you will compile your model with Boto3.

To compile a model, SageMaker Neo requires the following information:

1.  **The Amazon S3 bucket URI where you stored the trained model.** 

   If you followed the prerequisites, the name of your bucket is stored in a variable named `bucket`. The following code snippet shows how to list all of your buckets using the AWS CLI: 

   ```
   aws s3 ls
   ```

   For example: 

   ```
   $ aws s3 ls
   2020-11-02 17:08:50 bucket
   ```

1.  **The Amazon S3 bucket URI where you want to save the compiled model.** 

   The code snippet below concatenates your Amazon S3 bucket URI with the name of an output directory called `output`: 

   ```
   s3_output_location = f's3://{bucket}/output'
   ```

1.  **The machine learning framework you used to train your model.** 

   Define the framework you used to train your model.

   ```
   framework = 'framework-name'
   ```

   For example, if you wanted to compile a model that was trained using TensorFlow, you could either use `tflite` or `tensorflow`. Use `tflite` if you want to use a lighter version of TensorFlow that uses less storage memory. 

   ```
   framework = 'tflite'
   ```

   For a complete list of Neo-supported frameworks, see [Supported Frameworks, Devices, Systems, and Architectures](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html). 

1.  **The shape of your model's input.** 

    Neo requires the name and shape of your input tensor. The name and shape are passed in as key-value pairs. `value` is a list of the integer dimensions of an input tensor and `key` is the exact name of an input tensor in the model. 

   ```
   data_shape = '{"name": [tensor-shape]}'
   ```

   For example:

   ```
   data_shape = '{"normalized_input_image_tensor":[1, 300, 300, 3]}'
   ```
**Note**  
Make sure the model is correctly formatted depending on the framework you used. See [What input data shapes does SageMaker Neo expect?](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html#neo-job-compilation-expected-inputs) The key in this dictionary must be changed to the new input tensor's name.

1.  **Either the name of the target device to compile for or the general details of the hardware platform** 

   ```
   target_device = 'target-device-name'
   ```

   For example, if you want to deploy to a Raspberry Pi 3, use: 

   ```
   target_device = 'rasp3b'
   ```

   You can find the entire list of supported edge devices in [Supported Frameworks, Devices, Systems, and Architectures](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html).

 Now that you have completed the previous steps, you can submit a compilation job to Neo. 

```
# Create a SageMaker client so you can submit a compilation job
sagemaker_client = boto3.client('sagemaker', region_name=AWS_REGION)

# Give your compilation job a name
compilation_job_name = 'getting-started-demo'
print(f'Compilation job for {compilation_job_name} started')

response = sagemaker_client.create_compilation_job(
    CompilationJobName=compilation_job_name,
    RoleArn=role_arn,
    InputConfig={
        'S3Uri': s3_input_location,
        'DataInputConfig': data_shape,
        'Framework': framework.upper()
    },
    OutputConfig={
        'S3OutputLocation': s3_output_location,
        'TargetDevice': target_device 
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 900
    }
)

# Optional - Poll every 30 sec to check completion status
import time

while True:
    response = sagemaker_client.describe_compilation_job(CompilationJobName=compilation_job_name)
    if response['CompilationJobStatus'] == 'COMPLETED':
        break
    elif response['CompilationJobStatus'] == 'FAILED':
        raise RuntimeError('Compilation failed')
    print('Compiling ...')
    time.sleep(30)
print('Done!')
```

If you want additional information for debugging, include the following print statement:

```
print(response)
```

If the compilation job is successful, your compiled model isstored in the output Amazon S3 bucket you specified earlier (`s3_output_location`). Download your compiled model locally: 

```
object_path = f'output/{model}-{target_device}.tar.gz'
neo_compiled_model = f'compiled-{model}.tar.gz'
s3_client.download_file(bucket, object_path, neo_compiled_model)
```

# Set Up Your Device
<a name="neo-getting-started-edge-step2"></a>

You will need to install packages on your edge device so that your device can make inferences. You will also need to either install [AWS IoT Greengrass](https://docs.aws.amazon.com/greengrass/latest/developerguide/what-is-gg.html) core or [Deep Learning Runtime (DLR)](https://github.com/neo-ai/neo-ai-dlr). In this example, you will install packages required to make inferences for the `coco_ssd_mobilenet` object detection algorithm and you will use DLR.

1. **Install additional packages**

   In addition to Boto3, you must install certain libraries on your edge device. What libraries you install depends on your use case. 

   For example, for the `coco_ssd_mobilenet` object detection algorithm you downloaded earlier, you need to install [NumPy](https://numpy.org/) for data manipulation and statistics, [PIL](https://pillow.readthedocs.io/en/stable/) to load images, and [Matplotlib](https://matplotlib.org/) to generate plots. You also need a copy of TensorFlow if you want to gauge the impact of compiling with Neo versus a baseline. 

   ```
   !pip3 install numpy pillow tensorflow matplotlib 
   ```

1. **Install inference engine on your device**

   To run your Neo-compiled model, install the [Deep Learning Runtime (DLR)](https://github.com/neo-ai/neo-ai-dlr) on your device. DLR is a compact, common runtime for deep learning models and decision tree models. On x86\$164 CPU targets running Linux, you can install the latest release of the DLR package using the following `pip` command:

   ```
   !pip install dlr
   ```

   For installation of DLR on GPU targets or non-x86 edge devices, refer to [Releases](https://github.com/neo-ai/neo-ai-dlr/releases) for prebuilt binaries, or [Installing DLR](https://neo-ai-dlr.readthedocs.io/en/latest/install.html) for building DLR from source. For example, to install DLR for Raspberry Pi 3, you can use: 

   ```
   !pip install https://neo-ai-dlr-release.s3-us-west-2.amazonaws.com/v1.3.0/pi-armv7l-raspbian4.14.71-glibc2_24-libstdcpp3_4/dlr-1.3.0-py3-none-any.whl
   ```

# Make Inferences on Your Device
<a name="neo-getting-started-edge-step3"></a>

In this example, you will use Boto3 to download the output of your compilation job onto your edge device. You will then import DLR, download an example images from the dataset, resize this image to match the model’s original input, and then you will make a prediction.

1. **Download your compiled model from Amazon S3 to your device and extract it from the compressed tarfile.** 

   ```
   # Download compiled model locally to edge device
   object_path = f'output/{model_name}-{target_device}.tar.gz'
   neo_compiled_model = f'compiled-{model_name}.tar.gz'
   s3_client.download_file(bucket_name, object_path, neo_compiled_model)
   
   # Extract model from .tar.gz so DLR can use it
   !mkdir ./dlr_model # make a directory to store your model (optional)
   !tar -xzvf ./compiled-detect.tar.gz --directory ./dlr_model
   ```

1. **Import DLR and an initialized `DLRModel` object.**

   ```
   import dlr
   
   device = 'cpu'
   model = dlr.DLRModel('./dlr_model', device)
   ```

1. **Download an image for inferencing and format it based on how your model was trained**.

   For the `coco_ssd_mobilenet` example, you can download an image from the [COCO dataset](https://cocodataset.org/#home) and then reform the image to `300x300`: 

   ```
   from PIL import Image
   
   # Download an image for model to make a prediction
   input_image_filename = './input_image.jpg'
   !curl https://farm9.staticflickr.com/8325/8077197378_79efb4805e_z.jpg --output {input_image_filename}
   
   # Format image so model can make predictions
   resized_image = image.resize((300, 300))
   
   # Model is quantized, so convert the image to uint8
   x = np.array(resized_image).astype('uint8')
   ```

1. **Use DLR to make inferences**.

   Finally, you can use DLR to make a prediction on the image you just downloaded: 

   ```
   out = model.run(x)
   ```

For more examples using DLR to make inferences from a Neo-compiled model on an edge device, see the [neo-ai-dlr Github repository](https://github.com/neo-ai/neo-ai-dlr). 

# Troubleshoot Errors
<a name="neo-troubleshooting"></a>

This section contains information about how to understand and prevent common errors, the error messages they generate, and guidance on how to resolve these errors. Before moving on, ask yourself the following questions:

 **Did you encounter an error before you deployed your model?** If yes, see [Troubleshoot Neo Compilation Errors](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html). 

 **Did you encounter an error after you compiled your model?** If yes, see [Troubleshoot Neo Inference Errors](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-inference.html). 

**Did you encounter an error trying to compile your model for Ambarella devices?** If yes, see [Troubleshoot Ambarella Errors](neo-troubleshooting-target-devices-ambarella.md).

## Error Classification Types
<a name="neo-error-messages"></a>

This list classifies the *user errors* you can receive from Neo. These include access and permission errors and load errors for each of the supported frameworks. All other errors are *system errors*.

### Client permission error
<a name="neo-error-client-permission"></a>

 Neo passes the errors for these straight through from the dependent service. 
+ *Access Denied* when calling sts:AssumeRole
+ *Any 400* error when calling Amazon S3 to download or upload a client model
+ *PassRole* error

### Load error
<a name="collapsible-section-2"></a>

Assuming that the Neo compiler successfully loaded .tar.gz from Amazon S3, check whether the tarball contains the necessary files for compilation. The checking criteria is framework-specific: 
+ **TensorFlow**: Expects only protobuf file (\$1.pb or \$1.pbtxt). For saved models, expects one variables folder. 
+ **Pytorch**: Expect only one pytorch file (\$1.pth).
+ **MXNET**: Expect only one symbol file (\$1.json) and one parameter file (\$1.params).
+ **XGBoost**: Expect only one XGBoost model file (\$1.model). The input model has size limitation.

### Compilation error
<a name="neo-error-compilation"></a>

Assuming that the Neo compiler successfully loaded .tar.gz from Amazon S3, and that the tarball contains necessary files for compilation. The checking criteria is: 
+ **OperatorNotImplemented**: An operator has not been implemented.
+ **OperatorAttributeNotImplemented**: The attribute in the specified operator has not been implemented. 
+ **OperatorAttributeRequired**: An attribute is required for an internal symbol graph, but it is not listed in the user input model graph. 
+ **OperatorAttributeValueNotValid**: The value of the attribute in the specific operator is not valid. 

**Topics**
+ [

## Error Classification Types
](#neo-error-messages)
+ [

# Troubleshoot Neo Compilation Errors
](neo-troubleshooting-compilation.md)
+ [

# Troubleshoot Neo Inference Errors
](neo-troubleshooting-inference.md)
+ [

# Troubleshoot Ambarella Errors
](neo-troubleshooting-target-devices-ambarella.md)

# Troubleshoot Neo Compilation Errors
<a name="neo-troubleshooting-compilation"></a>

This section contains information about how to understand and prevent common compilation errors, the error messages they generate, and guidance on how to resolve these errors. 

**Topics**
+ [

## How to Use This Page
](#neo-troubleshooting-compilation-how-to-use)
+ [

## Framework-Related Errors
](#neo-troubleshooting-compilation-framework-related-errors)
+ [

## Infrastructure-Related Errors
](#neo-troubleshooting-compilation-infrastructure-errors)
+ [

## Check your compilation log
](#neo-troubleshooting-compilation-logs)

## How to Use This Page
<a name="neo-troubleshooting-compilation-how-to-use"></a>

Attempt to resolve your error by the going through these sections in the following order:

1. Check that the input of your compilation job satisfies the input requirements. See [What input data shapes does SageMaker Neo expect?](neo-compilation-preparing-model.md#neo-job-compilation-expected-inputs)

1.  Check common [framework-specific errors](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html#neo-troubleshooting-compilation-framework-related-errors). 

1.  Check if your error is an [infrastructure error](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html#neo-troubleshooting-compilation-infrastructure-errors). 

1. Check your [compilation log](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html#neo-troubleshooting-compilation-logs).

## Framework-Related Errors
<a name="neo-troubleshooting-compilation-framework-related-errors"></a>

### Keras
<a name="neo-troubleshooting-compilation-framework-related-errors-keras"></a>


| Error | Solution | 
| --- | --- | 
|   `InputConfiguration: No h5 file provided in <model path>`   |   Check your h5 file is in the Amazon S3 URI you specified.  *Or* Check that the [h5 file is correctly formatted](https://www.tensorflow.org/guide/keras/save_and_serialize#keras_h5_format).   | 
|   `InputConfiguration: Multiple h5 files provided, <model path>, when only one is allowed`   |  Check you are only providing one `h5` file.  | 
|   `ClientError: InputConfiguration: Unable to load provided Keras model. Error: 'sample_weight_mode'`   |  Check the Keras version you specified is supported. See, supported frameworks for [cloud instances](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html) and [edge devices](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html).   | 
|   `ClientError: InputConfiguration: Input input has wrong shape in Input Shape dictionary. Input shapes should be provided in NCHW format. `   |   Check that your model input follows NCHW format. See [What input data shapes does SageMaker Neo expect?](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-job-compilation.html#neo-job-compilation-expected-inputs)   | 

### MXNet
<a name="neo-troubleshooting-compilation-framework-related-errors-mxnet"></a>


| Error | Solution | 
| --- | --- | 
|   `ClientError: InputConfiguration: Only one parameter file is allowed for MXNet model. Please make sure the framework you select is correct.`   |   SageMaker Neo will select the first parameter file given for compilation.   | 

### TensorFlow
<a name="neo-troubleshooting-compilation-framework-related-errors-tensorflow"></a>


| Error | Solution | 
| --- | --- | 
|   `InputConfiguration: Exactly one .pb file is allowed for TensorFlow models.`   |  Make sure you only provide one .pb or .pbtxt file.  | 
|  `InputConfiguration: Exactly one .pb or .pbtxt file is allowed for TensorFlow models.`  |  Make sure you only provide one .pb or .pbtxt file.  | 
|   ` ClientError: InputConfiguration: TVM cannot convert <model zoo> model. Please make sure the framework you selected is correct. The following operators are not implemented: {<operator name>} `   |   Check the operator you chose is supported. See [SageMaker Neo Supported Frameworks and Operators](https://aws.amazon.com/releasenotes/sagemaker-neo-supported-frameworks-and-operators/).   | 

### PyTorch
<a name="neo-troubleshooting-compilation-framework-related-errors-pytorch"></a>


| Error | Solution | 
| --- | --- | 
|   `InputConfiguration: We are unable to extract DataInputConfig from the model due to input_config_derivation_error. Please override by providing a DataInputConfig during compilation job creation.`  |  Do either of the following: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html)  | 

## Infrastructure-Related Errors
<a name="neo-troubleshooting-compilation-infrastructure-errors"></a>


| Error | Solution | 
| --- | --- | 
|   `ClientError: InputConfiguration: S3 object does not exist. Bucket: <bucket>, Key: <bucket key>`   |  Check the Amazon S3 URI your provided.  | 
|   ` ClientError: InputConfiguration: Bucket <bucket name> is in region <region name> which is different from AWS Sagemaker service region <service region> `   |   Create an Amazon S3 bucket that is in the same region as the service.   | 
|   ` ClientError: InputConfiguration: Unable to untar input model. Please confirm the model is a tar.gz file `   |   Check that your model in Amazon S3 is compressed into a `tar.gz` file.   | 

## Check your compilation log
<a name="neo-troubleshooting-compilation-logs"></a>

1. Navigate to Amazon CloudWatch at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Select the region you created the compilation job from the **Region** dropdown list in the top right.

1. In the navigation pane of the Amazon CloudWatch, choose **Logs**. Select **Log groups**.

1. Search for the log group called `/aws/sagemaker/CompilationJobs`. Select the log group.

1. Search for the logstream named after the compilation job name. Select the log stream.

# Troubleshoot Neo Inference Errors
<a name="neo-troubleshooting-inference"></a>

This section contains information about how to prevent and resolve some of the common errors you might encounter upon deploying and/or invoking the endpoint. This section applies to **PyTorch 1.4.0 or later** and **MXNet v1.7.0 or later**. 
+ Make sure the first inference (warm-up inference) on a valid input data is done in `model_fn()`, if you defined a `model_fn` in your inference script, otherwise the following error message may be seen on the terminal when [https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict) is called: 

  ```
  An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from <users-sagemaker-endpoint> with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again."                
  ```
+ Make sure that the environment variables in the following table are set. If they are not set, the following error message might show up: 

  **On the terminal:**

  ```
  An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from <users-sagemaker-endpoint> with message "{ "code": 503, "type": "InternalServerException", "message": "Prediction failed" } ".
  ```

  **In CloudWatch:**

  ```
  W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - AttributeError: 'NoneType' object has no attribute 'transform'
  ```    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-inference.html)
+ Make sure that the `MMS_DEFAULT_RESPONSE_TIMEOUT` environment variable is set to 500 or a higher value while creating the Amazon SageMaker AI model; otherwise, the following error message may be seen on the terminal: 

  ```
  An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from <users-sagemaker-endpoint> with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again."
  ```

# Troubleshoot Ambarella Errors
<a name="neo-troubleshooting-target-devices-ambarella"></a>

SageMaker Neo requires models to be packaged in a compressed TAR file (`*.tar.gz`). Ambarella devices require additional files to be included within the compressed TAR file before it is sent for compilation. Include the following files within your compressed TAR file if you want to compile a model for Ambarella targets with SageMaker Neo:
+ A trained model using a framework supported by SageMaker Neo 
+ A JSON configuration file
+ Calibration images

For example, the contents of your compressed TAR file should look similar to the following example:

```
├──amba_config.json
├──calib_data
|    ├── data1
|    ├── data2
|    ├── .
|    ├── .
|    ├── .
|    └── data500
└──mobilenet_v1_1.0_0224_frozen.pb
```

The directory is configured as follows:
+ `amba_config.json` : Configuration file
+ `calib_data` : Folder containing calibration images
+ `mobilenet_v1_1.0_0224_frozen.pb` : TensorFlow model saved as a frozen graph

For information about frameworks supported by SageMaker Neo, see [Supported Frameworks](neo-supported-devices-edge-frameworks.md).

## Setting up the Configuration File
<a name="neo-troubleshooting-target-devices-ambarella-config"></a>

The configuration file provides information required by the Ambarella toolchain to compile the model. The configuration file must be saved as a JSON file and the name of the file must end with `*config.json`. The following chart shows the contents of the configuration file.


| Key | Description | Example | 
| --- | --- | --- | 
| inputs | Dictionary mapping input layers to attribute. | <pre>{inputs:{"data":{...},"data1":{...}}}</pre> | 
| "data" | Input layer name. Note: "data" is an example of the name you can use to label the input layer. | "data" | 
| shape | Describes the shape of the input to the model. This follows the same conventions that SageMaker Neo uses. | "shape": "1,3,224,224" | 
| filepath | Relative path to the directory containing calibration images. These can be binary or image files like JPG or PNG. | "filepath": "calib\$1data/" | 
| colorformat | Color format that model expects. This will be used while converting images to binary. Supported values: [RGB, BGR]. Default is RGB. | "colorformat":"RGB" | 
| mean | Mean value to be subtracted from the input. Can be a single value or a list of values. When the mean is given as a list the number of entries must match the channel dimension of the input. | "mean":128.0 | 
| scale | Scale value to be used for normalizing the input. Can be a single value or a list of values. When the scale is given as a list, the number of entries must match the channel dimension of the input. | "scale": 255.0 | 

The following is a sample configuration file: 

```
{
    "inputs": {
        "data": {
                "shape": "1, 3, 224, 224",
                "filepath": "calib_data/",
                "colorformat": "RGB",
                "mean":[128,128,128],
                "scale":[128.0,128.0,128.0]
        }
    }
}
```

## Calibration Images
<a name="neo-troubleshooting-target-devices-ambarella-calibration-images"></a>

Quantize your trained model by providing calibration images. Quantizing your model improves the performance of the CVFlow engine on an Ambarella System on a Chip (SoC). The Ambarella toolchain uses the calibration images to determine how each layer in the model should be quantized to achieve optimal performance and accuracy. Each layer is quantized independently to INT8 or INT16 formats. The final model has a mix of INT8 and INT16 layers after quantization.

**How many images should you use?**

It is recommended that you include between 100–200 images that are representative of the types of scenes the model is expected to handle. The model compilation time increases linearly to the number of calibration images in the input file.

**What are the recommended image formats?**

Calibration images can be in a raw binary format or image formats such as JPG and PNG.

Your calibration folder can contain a mixture of images and binary files. If the calibration folder contains both images and binary files, the toolchain first converts the images to binary files. Once the conversion is complete, it uses the newly generated binary files along with the binary files that were originally in the folder.

**Can I convert the images into binary format first?**

Yes. You can convert the images to the binary format with open-source packages such as [OpenCV](https://opencv.org/) or [PIL](https://python-pillow.org/). Crop and resize the images so they satisfy the input layer of your trained model.



## Mean and Scale
<a name="neo-troubleshooting-target-devices-ambarella-mean-scale"></a>

You can specify mean and scaling pre-processing options to the Amberalla toolchain. These operations are embedded into the network and are applied during inference on each input. Do not provide processed data if you specify the mean or scale. More specifically, do not provide data you have subtracted the mean from or have applied scaling to.

## Check your compilation log
<a name="neo-troubleshooting-target-devices-ambarella-compilation"></a>

For information on checking compilation log for Ambarella devices, see [Check your compilation log](neo-troubleshooting-compilation.md#neo-troubleshooting-compilation-logs).

# Stateful sessions with Amazon SageMaker AI models
<a name="stateful-sessions"></a>

When you send requests to an Amazon SageMaker AI inference endpoint, you can choose to route the requests to a *stateful session*. During a stateful session, you send multiple inference requests to the same ML instance, and the instance facilitates the session.

Normally, when you invoke an inference endpoint, Amazon SageMaker AI routes your request to any one ML instance among the multiple instances that the endpoint hosts. This routing behavior helps minimize latency by evenly distributing your inference traffic. However, one outcome of the routing behavior is that you can't predict which instance will serve your request. 

This unpredictability is a limitation if you intend to send your request to a *stateful model*. A stateful model has a container that caches the context data that it receives from inference requests. Because the data is cached, you can interact with the container by sending multiple requests, and with each request, you don't need to include the full context of the interaction. Instead, the model draws from the cached context data to inform its prediction. 

Stateful models are ideal when the context data for the interaction is very large, such as when it includes the following:
+ Large text files
+ Long chat histories 
+ Multimedia data (images, video, and audio) for multimodal models

In these cases, if you pass the full context with every prompt, the network latency of your requests is slowed, and responsiveness of your application is diminished. 

Before your inference endpoint can support a stateful session, it must host a stateful model. The implementation of the stateful model is owned by you. Amazon SageMaker AI makes it possible for you to route your requests to a stateful session, but it doesn't provide stateful models that you can deploy and use. 

For an example notebook and model container that demonstrates how stateful interactions are implemented, see [Example implementation](#stateful-sessions-example-notebook).

For information about implementing stateful models with TorchServe, see [Stateful Inference](https://github.com/pytorch/serve/tree/master/examples/stateful/sequence_continuous_batching) in the TorchServe GitHub repository. 

## How stateful sessions work
<a name="stateful-sessions-running"></a>

During a stateful session, your application interacts with your model container in the following ways. 

**To start a stateful session**

1. To start a session with a stateful model that's hosted by Amazon SageMaker AI, your client sends an [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) request with the SageMaker API. For the `SessionID` request parameter, the client tells SageMaker AI to start a new session by specifying the value `NEW_SESSION`. In the request payload, the client also tells the container to start a new session. The syntax of this statement varies based on your container implementation. It depends on how your container code handles the request payload.

   The following example starts a new session by using the SDK for Python (Boto3):

   ```
   import boto3
   import sagemaker
   import json
   
   payload = {
   "requestType":"NEW_SESSION"
   }
   payload = json.dumps(payload)
   
   smr = boto3.client(
       'sagemaker-runtime',
       region_name="region_name",
       endpoint_url="endoint_url")
   
   create_session_response = smr.invoke_endpoint(
       EndpointName="endpoint_name",
       Body=payload,
       ContentType="application/json",
       SessionId="NEW_SESSION")
   ```

1. Your model container handles your client's request by starting a new session. For the session, it caches the data that the client sends in the request payload. It also creates a session ID, and it sets a time to live (TTL) timestamp. This timestamp indicates when the session expires. The container must provide the session ID and timestamp to Amazon SageMaker AI by setting the following HTTP header in the response:

   ```
   X-Amzn-SageMaker-Session-Id: session_id; Expires=yyyy-mm-ddThh:mm:ssZ
   ```

1. In the response to the `InvokeEndpoint` request, Amazon SageMaker AI provides the session ID and TTL timestamp for the `NewSessionID` response parameter.

   The following example extracts the session ID from the `invoke_endpoint` response:

   ```
   session_id = create_session_response['ResponseMetadata']['HTTPHeaders']['x-amzn-sagemaker-new-session-id'].split(';')[0]
   ```

**To continue a stateful session**
+ To use the same session for a subsequent inference request, your client sends another `InvokeEndpoint` request. For the `SessionID` request parameter, it specifies the ID of the session. With this ID, SageMaker AI routes the request to the same ML instance where the session was started. Because your container has already cached the original request payload, your client doesn't need to pass the same context data that was in the original request.

  The following example continues a session by passing the session ID with the `SessionId` request parameter:

  ```
  smr.invoke_endpoint(
      EndpointName="endpoint_name",
      Body=payload,
      ContentType="application/json",
      SessionId=session_id)
  ```

**To close a stateful session**

1. To close a session, your client sends a final `InvokeEndpoint` request. For the `SessionID` request parameter, the client provides the ID of the session. In the payload in the request body, your client states that the container should close the session. The syntax of this statement varies based on your container implementation.

   The following example closes a session:

   ```
   payload = {
       "requestType":"CLOSE"
   }
   payload = json.dumps(payload)
   
   closeSessionResponse = smr.invoke_endpoint(
       EndpointName="endpoint_name",
       Body=payload,
       ContentType="application/json",
       SessionId=session_id)
   ```

1. When it closes the session, the container returns the session ID to SageMaker AI by setting the following HTTP header in the response:

   ```
   X-Amzn-SageMaker-Closed-Session-Id: session_id
   ```

1. In the response to the `InvokeEndpoint` request from the client, SageMaker AI provides the session ID for the `ClosedSessionId` response parameter.

   The following example extracts the closed session ID from the `invoke_endpoint` response:

   ```
   closed_session_id = closeSessionResponse['ResponseMetadata']['HTTPHeaders']['x-amzn-sagemaker-closed-session-id'].split(';')[0]
   ```

## Example implementation
<a name="stateful-sessions-example-notebook"></a>

The following example notebook demonstrates how to implement the container for a stateful model. It also demonstrates how a client application starts, continues, and closes a stateful session.

[LLaVA stateful inference with SageMaker AI](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/LLava/torchserve/workspace/llava_stateful_deploy_infer.ipynb)

The notebook uses the [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA/tree/main) model, which accepts images and text prompts. The notebook uploads an image to the model, and then it asks questions about the image without having to resend the image for every request. The model container uses the TorchServe framework. It caches the image data in GPU memory.

# Best practices
<a name="best-practices"></a>

The following topics provide guidance on best practices for deploying machine learning models in Amazon SageMaker AI.

**Topics**
+ [

# Best practices for deploying models on SageMaker AI Hosting Services
](deployment-best-practices.md)
+ [

# Monitor Security Best Practices
](monitor-sec-best-practices.md)
+ [

# Low latency real-time inference with AWS PrivateLink
](realtime-endpoints-privatelink.md)
+ [

# Migrate inference workload from x86 to AWS Graviton
](realtime-endpoints-graviton.md)
+ [

# Troubleshoot Amazon SageMaker AI model deployments
](deploy-model-troubleshoot.md)
+ [

# Inference cost optimization best practices
](inference-cost-optimization.md)
+ [

# Best practices to minimize interruptions during GPU driver upgrades
](inference-gpu-drivers.md)
+ [

# Best practices for endpoint security and health with Amazon SageMaker AI
](best-practice-endpoint-security.md)
+ [

# Updating inference containers to comply with the NVIDIA Container Toolkit
](container-nvidia-compliance.md)

# Best practices for deploying models on SageMaker AI Hosting Services
<a name="deployment-best-practices"></a>

When hosting models using SageMaker AI hosting services, consider the following:
+ Typically, a client application sends requests to the SageMaker AI HTTPS endpoint to obtain inferences from a deployed model. You can also send requests to this endpoint from your Jupyter notebook during testing.
+ You can deploy a model trained with SageMaker AI to your own deployment target. To do that, you need to know the algorithm-specific format of the model artifacts that were generated by model training. For more information about output formats, see the section corresponding to the algorithm you are using in [Common Data Formats for Training](cdf-training.md). 
+ You can deploy multiple variants of a model to the same SageMaker AI HTTPS endpoint. This is useful for testing variations of a model in production. For example, suppose that you've deployed a model into production. You want to test a variation of the model by directing a small amount of traffic, say 5%, to the new model. To do this, create an endpoint configuration that describes both variants of the model. You specify the `ProductionVariant` in your request to the `CreateEndPointConfig`. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). 
+ You can configure a `ProductionVariant` to use Application Auto Scaling. For information about configuring automatic scaling, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md).
+ You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML Compute instance configurations of existing model variants, or change the distribution of traffic among model variants. To modify an endpoint, you provide a new endpoint configuration. SageMaker AI implements the changes without any downtime. For more information see, [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html). 
+ Changing or deleting model artifacts or changing inference code after deploying a model produces unpredictable results. If you need to change or delete model artifacts or change inference code, modify the endpoint by providing a new endpoint configuration. Once you provide the new endpoint configuration, you can change or delete the model artifacts corresponding to the old endpoint configuration.
+ If you want to get inferences on entire datasets, consider using batch transform as an alternative to hosting services. For information, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md) 

## Deploy Multiple Instances Across Availability Zones
<a name="deployment-best-practices-availability-zones"></a>

**Create robust endpoints when hosting your model.** SageMaker AI endpoints can help protect your application from [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) outages and instance failures. If an outage occurs or an instance fails, SageMaker AI automatically attempts to distribute your instances across Availability Zones. For this reason, we strongly recommend that you deploy multiple instances for each production endpoint. 

If you are using an [Amazon Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html), configure the VPC with at least two [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html#SageMaker-Type-VpcConfig-Subnets                     .html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html#SageMaker-Type-VpcConfig-Subnets                     .html), each in a different Availability Zone. If an outage occurs or an instance fails, Amazon SageMaker AI automatically attempts to distribute your instances across Availability Zones. 

In general, to achieve more reliable performance, use more small [Instance Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html) in different Availability Zones to host your endpoints.

**Deploy inference components for high availability.** In addition to the above recommendation for instance numbers, to achieve 99.95% availability, ensure that your inference components are configured to have more than two copies. In addition, in your managed auto scaling policy, set the minimum number of instances to two as well.

# Monitor Security Best Practices
<a name="monitor-sec-best-practices"></a>

Monitor your usage of SageMaker AI as it relates to security best practices by using [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html). Security Hub CSPM uses security controls to evaluate resource configurations and security standards to help you comply with various compliance frameworks. For more information about using Security Hub CSPM to evaluate SageMaker AI resources, see [Amazon SageMaker AI controls](https://docs.aws.amazon.com/securityhub/latest/userguide/sagemaker-controls.html) in the *AWS Security Hub CSPM User Guide*.

# Low latency real-time inference with AWS PrivateLink
<a name="realtime-endpoints-privatelink"></a>

 Amazon SageMaker AI provides low latency for real-time inferences while maintaining high availability and resiliency using multi-AZ deployment. The application latency is made up of two primary components: infrastructure or overhead latency and model inference latency. Reduction of overhead latency opens up new possibilities such as deploying more complex, deep, and accurate models or splitting monolithic applications into scalable and maintainable microservice modules. You can reduce the latency for real-time inferences with SageMaker AI using an AWS PrivateLink deployment. With AWS PrivateLink, you can privately access all SageMaker API operations from your Virtual Private Cloud (VPC) in a scalable manner by using interface VPC endpoints. An interface VPC endpoint is an elastic network interface in your subnet with private IP addresses that serves as an entry point for all SageMaker API calls.

By default, a SageMaker AI endpoint with 2 or more instances is deployed in at least 2 AWS Availability Zones (AZs) and instances in any AZ can process invocations. This results in one or more AZ “hops” that contribute to the overhead latency. An AWS PrivateLink deployment with the `privateDNSEnabled` option set as `true` alleviates this by achieving two objectives:
+ It keeps all inference traffic within your VPC.
+ It keeps invocation traffic in the same AZ as the client that originated it when using SageMaker Runtime. This avoids the “hops” between AZs reducing the overhead latency.

The following sections of this guide demonstrate how you can reduce the latency for real-time inferences with AWS PrivateLink deployment.

**Topics**
+ [

## Deploy AWS PrivateLink
](#deploy-privatelink)
+ [

## Deploy SageMaker AI endpoint in a VPC
](#deploy-sagemaker-inference-endpoint)
+ [

## Invoke the SageMaker AI endpoint
](#invoke-sagemaker-inference-endpoint)

## Deploy AWS PrivateLink
<a name="deploy-privatelink"></a>

To deploy AWS PrivateLink, first create an interface endpoint for the VPC from which you connect to the SageMaker AI endpoints. Please follow the steps in [ Access an AWS service using an interface VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html) to create the interface endpoint. While creating the endpoint, select the following settings in the console interface:
+ Select the **Enable DNS name** checkbox under **Additional Settings**
+ Select the appropriate security groups and the subnets to be used with the SageMaker AI endpoints.

Also make sure that the VPC has DNS hostnames turned on. For more information on how to change DNS attributes for your VPC, see [ View and update DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-updating).

## Deploy SageMaker AI endpoint in a VPC
<a name="deploy-sagemaker-inference-endpoint"></a>

To achieve low overhead latency, create a SageMaker AI endpoint using the same subnets that you specified when deploying AWS PrivateLink. These subnets should match the AZs of your client application, as shown in the following code snippet.

```
model_name = '<the-name-of-your-model>'

vpc = 'vpc-0123456789abcdef0'
subnet_a = 'subnet-0123456789abcdef0'
subnet_b = 'subnet-0123456789abcdef1'
security_group = 'sg-0123456789abcdef0'

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url
    },
    VpcConfig = {
        'SecurityGroupIds': [security_group],
        'Subnets': [subnet_a, subnet_b],
    },
)
```

The aforementioned code snippet assumes that you have followed the steps in [Before you begin](realtime-endpoints-deploy-models.md#deploy-prereqs).

## Invoke the SageMaker AI endpoint
<a name="invoke-sagemaker-inference-endpoint"></a>

Finally, specify the SageMaker Runtime client and invoke the SageMaker AI endpoint as shown in the following code snippet.

```
endpoint_name = '<endpoint-name>'
  
runtime_client = boto3.client('sagemaker-runtime')
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                          ContentType='text/csv', 
                                          Body=payload)
```

For more information on endpoint configuration, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).

# Migrate inference workload from x86 to AWS Graviton
<a name="realtime-endpoints-graviton"></a>

 [AWS Graviton](https://aws.amazon.com/ec2/graviton/) is a series of ARM-based processors designed by AWS. They are more energy efficient than x86-based processors and offer a compelling price-performance ratio. Amazon SageMaker AI offers Graviton-based instances so that you can take advantage of these advanced processors for your inference needs. 

 You can migrate your existing inference workloads from x86-based instances to Graviton-based instances, by using either ARM compatible container images or multi-architecture container images. This guide assumes that you are either using [AWS Deep Learning container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or your own ARM compatible container images. For more information on building your own images, check [Building your image](https://github.com/aws/deep-learning-containers#building-your-image). 

 At a high level, migrating inference workload from x86-based instances to Graviton-based instances is a four-step process: 

1. Push container images to Amazon Elastic Container Registry (Amazon ECR), an AWS managed container registry.

1. Create a SageMaker AI Model.

1. Create an endpoint configuration.

1. Create an endpoint.

 The following sections of this guide provide more details regarding the above steps. Replace the *user placeholder text* in the code examples with your own information. 

**Topics**
+ [

## Push container images to Amazon ECR
](#realtime-endpoints-graviton-ecr)
+ [

## Create a SageMaker AI Model
](#realtime-endpoints-graviton-model)
+ [

## Create an endpoint configuration
](#realtime-endpoints-graviton-epc)
+ [

## Create an endpoint
](#realtime-endpoints-graviton-ep)

## Push container images to Amazon ECR
<a name="realtime-endpoints-graviton-ecr"></a>

 You can push your container images to Amazon ECR with the AWS CLI. When using an ARM compatible image, verify that it supports ARM architecture: 

```
docker inspect deep-learning-container-uri
```

 The response `"Architecture": "arm64"` indicates that the image supports ARM architecture. You can push it to Amazon ECR with the `docker push` command. For more information, check [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html). 

 Multi-architecture container images are fundamentally a set of container images supporting different architectures or operating systems, that you can refer to by a common manifest name. If you are using multi-architecture container images, then in addition to pushing the images to Amazon ECR, you will also have to push a manifest list to Amazon ECR. A manifest list allows for the nested inclusion of other image manifests, where each included image is specified by architecture, operating system and other platform attributes. The following example creates a manifest list, and pushes it to Amazon ECR. 

1. Create a manifest list.

   ```
   docker manifest create aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository \
     aws-account-id.dkr.ecr.aws-account-id.amazonaws.com/my-repository:amd64 \
   	aws-account-id.dkr.ecr.aws-account-id.amazonaws.com/my-repository:arm64 \
   ```

1.  Annotate the manifest list, so that it correctly identifies which image is for which architecture. 

   ```
   docker manifest annotate --arch arm64 aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository \
     aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository:arm64
   ```

1. Push the manifest.

   ```
   docker manifest push aws-account-id.dkr.ecr.aws-region.amazonaws.com/my-repository
   ```

 For more information on creating and pushing manifest lists to Amazon ECR, check [Introducing multi-architecture container images for Amazon ECR](https://aws.amazon.com/blogs/containers/introducing-multi-architecture-container-images-for-amazon-ecr/), and [Pushing a multi-architecture image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-multi-architecture-image.html). 

## Create a SageMaker AI Model
<a name="realtime-endpoints-graviton-model"></a>

 Create a SageMaker AI Model by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. 

```
import boto3
from sagemaker import get_execution_role


aws_region = "aws-region"
sagemaker_client = boto3.client("sagemaker", region_name=aws_region)

role = get_execution_role()

sagemaker_client.create_model(
    ModelName = "model-name",
    PrimaryContainer = {
        "Image": "deep-learning-container-uri",
        "ModelDataUrl": "model-s3-location",
        "Environment": {
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_SUBMIT_DIRECTORY": "inference-script-s3-location",
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_REGION": aws_region,
        }
    },
    ExecutionRoleArn = role
)
```

## Create an endpoint configuration
<a name="realtime-endpoints-graviton-epc"></a>

 Create an endpoint configuration by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. For a list of Graviton-based instances, check [Compute optimized instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized-instances.html). 

```
sagemaker_client.create_endpoint_config(
    EndpointConfigName = "endpoint-config-name",
    ProductionVariants = [
        {
            "VariantName": "variant-name",
            "ModelName": "model-name",
            "InitialInstanceCount": 1,
            "InstanceType": "ml.c7g.xlarge", # Graviton-based instance
       }
    ]
)
```

## Create an endpoint
<a name="realtime-endpoints-graviton-ep"></a>

 Create an endpoint by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API. 

```
sagemaker_client.create_endpoint(
    EndpointName = "endpoint-name",
    EndpointConfigName = "endpoint-config-name"
)
```

# Troubleshoot Amazon SageMaker AI model deployments
<a name="deploy-model-troubleshoot"></a>

If you encounter an issue when deploying machine learning models in Amazon SageMaker AI, see the following guidance.

**Topics**
+ [

## Detection Errors in the Active CPU Count
](#deploy-model-troubleshoot-jvms)
+ [

## Issues with deploying a model.tar.gz file
](#deploy-model-troubleshoot-tarballs)
+ [

## Primary container did not pass ping health checks
](#deploy-model-troubleshoot-ping)

## Detection Errors in the Active CPU Count
<a name="deploy-model-troubleshoot-jvms"></a>

If you deploy a SageMaker AI model with a Linux Java Virtual Machine (JVM), you might encounter detection errors that prevent using available CPU resources. This issue affects some JVMs that support Java 8 and Java 9, and most that support Java 10 and Java 11. These JVMs implement a mechanism that detects and handles the CPU count and the maximum memory available when running a model in a Docker container, and, more generally, within Linux `taskset` commands or control groups (cgroups). SageMaker AI deployments take advantage of some of the settings that the JVM uses for managing these resources. Currently, this causes the container to incorrectly detect the number of available CPUs. 

SageMaker AI doesn't limit access to CPUs on an instance. However, the JVM might detect the CPU count as `1` when more CPUs are available for the container. As a result, the JVM adjusts all of its internal settings to run as if only `1` CPU core is available. These settings affect garbage collection, locks, compiler threads, and other JVM internals that negatively affect the concurrency, throughput, and latency of the container.

For an example of the misdetection, in a container configured for SageMaker AI that is deployed with a JVM that is based on Java8\$1191 and that has four available CPUs on the instance, run the following command to start your JVM:

```
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version
```

This generates the following output:

```
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: determined by OSContainer: 1
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
```

Many of the JVMs affected by this issue have an option to disable this behavior and reestablish full access to all of the CPUs on the instance. Disable the unwanted behavior and establish full access to all instance CPUs by including the `-XX:-UseContainerSupport` parameter when starting Java applications. For example, run the `java` command to start your JVM as follows:

```
java -XX:-UseContainerSupport -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus -version
```

This generates the following output:

```
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
active_processor_count: sched_getaffinity processor count: 4
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
```

Check whether the JVM used in your container supports the `-XX:-UseContainerSupport` parameter. If it does, always pass the parameter when you start your JVM. This provides access to all of the CPUs in your instances. 

You might also encounter this issue when indirectly using a JVM in SageMaker AI containers. For example, when using a JVM to support SparkML Scala. The `-XX:-UseContainerSupport` parameter also affects the output returned by the Java `Runtime.getRuntime().availableProcessors()` API ``. 

## Issues with deploying a model.tar.gz file
<a name="deploy-model-troubleshoot-tarballs"></a>

When you deploy a model using a `model.tar.gz` file, the model tarball should not include any symlinks. Symlinks cause the model creation to fail. Also, we recommend that you do not include any unnecessary files in the tarball.

## Primary container did not pass ping health checks
<a name="deploy-model-troubleshoot-ping"></a>

 If your primary container fails ping health checks with the following error message, it indicates that there is an issue with your container or script: 

```
The primary container for production variant beta did not pass the ping health check. Please check CloudWatch Logs logs for this endpoint.
```

 To troubleshoot this issue, you should check the CloudWatch Logs logs for the endpoint in question to see if there are any errors or issues that are preventing the container from responding to `/ping` or `/invocations`. The logs may provide an error message that could point to the issue. Once you have identified the error and failure reason you should resolve the error. 

 It is also good practice to test the model deployment locally before creating an endpoint. 
+  Use local mode in the SageMaker SDK to imitate the hosted environment by deploying the model to a local endpoint. For more information, see [Local Mode](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode). 
+  Use vanilla docker commands to test the container responds to /ping and /invocations. For more information, see [local\$1test](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container/local_test). 

# Inference cost optimization best practices
<a name="inference-cost-optimization"></a>

The following content provides techniques and considerations for optimizing the cost of endpoints. You can use these recommendations to optimize the cost for both new and existing endpoints.

## Best practices
<a name="inference-cost-optimization-list"></a>

To optimize your SageMaker AI Inference costs, follow these best practices.

### Pick the best inference option for the job.
<a name="collapsible-1"></a>

SageMaker AI offers 4 different inference options to provide the best inference option for the job. You may be able to save on costs by picking the inference option that best matches your workload.
+ Use [real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) for low latency workloads with predictable traffic patterns that need to have consistent latency characteristics and are always available. You pay for using the instance.
+ Use [serverless inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) for synchronous workloads that have a spiky traffic pattern and can accept variations in the p99 latency. Serverless inference automatically scales to meet your workload traffic so you don’t pay for any idle resources. You only pay for the duration of the inference request. The same model and containers can be used with both real-time and serverless inference so you can switch between these two modes if your needs change.
+ Use [asynchronous inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) for asynchronous workloads that process up to 1 GB of data (such as text corpus, image, video, and audio) that are latency insensitive and cost sensitive. With asynchronous inference, you can control costs by specifying a fixed number of instances for the optimal processing rate instead of provisioning for the peak. You can also scale down to zero to save additional costs.
+ Use [batch inference](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) for workloads for which you need inference for a large set of data for processes that happen offline (that is, you don’t need a persistent endpoint). You pay for the instance for the duration of the batch inference job.

### Opt in to a SageMaker AI Savings Plan.
<a name="collapsible-2"></a>
+ If you have a consistent usage level across all SageMaker AI services, you can opt in to a SageMaker AI Savings Plan to help reduce your costs by up to 64%.
+ [Amazon SageMaker AI Savings Plans](https://aws.amazon.com/savingsplans/ml-pricing/) provide a flexible pricing model for Amazon SageMaker AI, in exchange for a commitment to a consistent amount of usage (measured in \$1/hour) for a one-year or three-year term. These plans automatically apply to eligible SageMaker AI ML instance usages including SageMaker Studio Classic Notebook, SageMaker On-Demand Notebook, SageMaker Processing, SageMaker Data Wrangler, SageMaker Training, SageMaker Real-Time Inference, and SageMaker Batch Transform regardless of instance family, size, or Region. For example, you can change usage from a CPU ml.c5.xlarge instance running in US East (Ohio) to a ml.Inf1 instance in US West (Oregon) for inference workloads at any time and automatically continue to pay the Savings Plans price.

### Optimize your model to run better.
<a name="collapsible-3"></a>
+ Unoptimized models can lead to longer run times and use more resources. You may choose to use more or bigger instances to improve performance; however, this leads to higher costs.
+ By optimizing your models to be more performant, you may be able to lower costs by using fewer or smaller instances while keeping the same or better performance characteristics. You can use [SageMaker Neo](https://aws.amazon.com/sagemaker/neo/) with SageMaker AI Inference to automatically optimize models. For more details and samples, see [Model performance optimization with SageMaker Neo](neo.md).

### Use the most optimal instance type and size for real-time inference.
<a name="collapsible-4"></a>
+ SageMaker Inference has over 70 instance types and sizes that can be used to deploy ML models including AWS Inferentia and Graviton chipsets that are optimized for ML. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models.
+ By using [Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html), you can quickly compare different instances to understand the performance of the model and the costs. With these results, you can choose the instance to deploy with the best return on investment.

### Improve efficiency and costs by combining multiple endpoints into a single endpoint for real-time inference.
<a name="collapsible-5"></a>
+ Costs can quickly add up when you deploy multiple endpoints, especially if the endpoints don’t fully utilize the underlying instances. To understand if the instance is under-utilized, check the utilization metrics (CPU, GPU, etc) in Amazon CloudWatch for your instances. If you have more than one of these endpoints, you can combine the models or containers on these multiple endpoints into a single endpoint.
+ Using [Multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) (MME) or [Multi-container endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html) (MCE), you can deploy multiple ML models or containers in a single endpoint to share the instance across multiple models or containers and improve your return on investment. To learn more, see this [Save on inference costs by using Amazon SageMaker AI multi-model endpoints](https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/) or [Deploy multiple serving containers on a single instance using Amazon SageMaker AI multi-container endpoints](https://aws.amazon.com/blogs/machine-learning/deploy-multiple-serving-containers-on-a-single-instance-using-amazon-sagemaker-multi-container-endpoints/) on the AWS Machine Learning blog.

### Set up autoscaling to match your workload requirements for real-time and asynchronous inference.
<a name="collapsible-6"></a>
+ Without autoscaling, you need to provision for peak traffic or risk model unavailability. Unless the traffic to your model is steady throughout the day, there will be excess unused capacity. This leads to low utilization and wasted resources.
+ [Autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) is an out-of-the-box feature that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the possible lowest cost. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances, helping you reduce your compute cost. To learn more, see [Configuring autoscaling inference endpoints in Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/) on the AWS Machine Learning blog.

# Best practices to minimize interruptions during GPU driver upgrades
<a name="inference-gpu-drivers"></a>

SageMaker AI Model Deployment upgrades GPU drivers on the ML instances for Real-time, Batch, and Asynchronous Inference options over time to provide customers access to improvements from the driver providers. Below you can see the GPU version supported for each Inference option. Different driver versions can change how your model interacts with the GPUs. Below are some strategies to help you understand how your application works with different driver versions. 

## Current versions and supported instance families
<a name="inference-gpu-drivers-versions"></a>

Amazon SageMaker AI Inference supports the following drivers and instance families:

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-gpu-drivers.html)

## Troubleshoot your model container with GPU capabilities
<a name="inference-gpu-drivers-troubleshoot"></a>

If you encounter an issue when running your GPU workload, see the following guidance:

### GPU card detection failure or NVIDIA initialization error
<a name="collapsible-section-0"></a>

Run the `nvidia-smi` (NVIDIA System Management Interface) command from within the Docker container. If the NVIDIA System Management Interface detects a GPU detection error or NVIDIA initialization error, it will return the following error message:

```
Failed to initialize NVML: Driver/library version mismatch
```

Based on your use case, follow these best practices to resolve the failure or error:
+ Follow the best practice recommendation described in the [If you bring your own (BYO) model containers](#collapsible-byoc) dropdown.
+ Follow the best practice recommendation described in the [If you use a CUDA compatibility layer](#collapsible-cuda-compat) dropdown.

Refer to the [NVIDIA System Management Interface page](https://developer.nvidia.com/nvidia-system-management-interface) on the NVIDIA website for more information.

### `CannotStartContainerError`
<a name="collapsible-section-cannot-start-container"></a>

 If your GPU instance uses NVIDIA driver versions that are not compatible with the CUDA version in the Docker container, then deploying an endpoint will fail with the following error message: 

```
 Failure reason CannotStartContainerError. Please ensure the model container for variant <variant_name> starts correctly when invoked with 'docker run <image> serve'
```

Based on your use case, follow these best practices to resolve the failure or error:
+ Follow the best practice recommendation described in the [The driver my container depends on is greater than the version on the ML GPU instances](#collapsible-driver-dependency-higher) dropdown.
+ Follow the best practice recommendation described in the [If you use a CUDA compatibility layer](#collapsible-cuda-compat) dropdown.

## Best practices for working with mismatched driver versions
<a name="inference-gpu-drivers-cuda-toolkit-updates"></a>

The following provides information on how to update your GPU driver:

### The driver my container depends on is lower than the version on the ML GPU instance
<a name="collapsible-driver-dependency-lower"></a>

No action is required. NVIDIA provides backwards compatibility.

### The driver my container depends on is greater than the version on the ML GPU instances
<a name="collapsible-driver-dependency-higher"></a>

If it is a minor version difference, no action is required. NVIDIA provides minor version forward compatibility.

If it is a major version difference, the CUDA Compatibility Package will need to be installed. Please refer to [CUDA Compatibility Package](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) in the NVIDIA documentation.

**Important**  
The CUDA Compatibility Package is not backwards compatible so it needs to be disabled if the driver version on the instance is greater than the CUDA Compatibility Package version.

### If you bring your own (BYO) model containers
<a name="collapsible-byoc"></a>

Ensure no NVIDIA driver packages are bundled in the image which could cause conflict with on host NVIDIA driver version.

### If you use a CUDA compatibility layer
<a name="collapsible-cuda-compat"></a>

To verify if the platform Nvidia driver version supports the CUDA Compatibility Package version installed in the model container, see the [CUDA documentation](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package). If the platform Nvidia driver version does not support the CUDA Compatibility Package version, you can disable or remove the CUDA Compatibility Package from the model container image. If the CUDA compatibility libs version is supported by the latest Nvidia driver version, we suggest that you enable the CUDA Compatibility Package based on the detected Nvidia driver version for future compatibility by adding the code snippet below into the container start up shell script (at the `ENTRYPOINT` script).

The script demonstrates how to dynamically switch the use of the CUDA Compatibility Package based on the detected Nvidia driver version on the deployed host for your model container. When SageMaker releases a newer Nvidia driver version, the installed CUDA Compatibility Package can be turned off automatically if the CUDA application is supported natively on the new driver.

```
#!/bin/bash

verlt() {
    [ "$1" = "$2" ] && return 1 || [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ]
}

if [ -f /usr/local/cuda/compat/libcuda.so.1 ]; then
    CUDA_COMPAT_MAX_DRIVER_VERSION=$(readlink /usr/local/cuda/compat/libcuda.so.1 | cut -d'.' -f 3-)
    echo "CUDA compat package should be installed for NVIDIA driver smaller than ${CUDA_COMPAT_MAX_DRIVER_VERSION}"
    NVIDIA_DRIVER_VERSION=$(sed -n 's/^NVRM.*Kernel Module *\([0-9.]*\).*$/\1/p' /proc/driver/nvidia/version 2>/dev/null || true)
    echo "Current installed NVIDIA driver version is ${NVIDIA_DRIVER_VERSION}"
    if verlt $NVIDIA_DRIVER_VERSION $CUDA_COMPAT_MAX_DRIVER_VERSION; then
        echo "Adding CUDA compat to LD_LIBRARY_PATH"
        export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
        echo $LD_LIBRARY_PATH
    else
        echo "Skipping CUDA compat setup as newer NVIDIA driver is installed"
    fi
else
    echo "Skipping CUDA compat setup as package not found"
fi
```

# Best practices for endpoint security and health with Amazon SageMaker AI
<a name="best-practice-endpoint-security"></a>

To address the latest security issues, Amazon SageMaker AI automatically patches endpoints to the latest and most secure software. However, if you incorrectly modify your endpoint dependencies, Amazon SageMaker AI can't automatically patch your endpoints or replace your unhealthy instances. To ensure your endpoints remain eligible for automatic updates, apply the following best practices.

## Don't delete resources while your endpoints use them
<a name="dont-delete-resources-in-use"></a>

Avoid deleting any of the following resources if you have existing endpoints that use them:
+ The model definition that you create with the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) action in the Amazon SageMaker API.
+ Any model artifacts that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-ModelDataUrl](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-ModelDataUrl) parameter.
+ The IAM role and permissions that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ExecutionRoleArn](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ExecutionRoleArn) parameter.
**Reminder**  
In the model definition that your endpoint uses, ensure that the IAM role that you specified has the correct permissions. For more information about the required permissions for Amazon SageMaker AI endpoints, see [CreateModel API: Execution Role Permissions](sagemaker-roles.md#sagemaker-roles-createmodel-perms).
+ The inference images that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-Image](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ContainerDefinition.html#sagemaker-Type-ContainerDefinition-Image) parameter, if you use your own inference code.
**Reminder**  
If you use the private registry feature, ensure that Amazon SageMaker AI can access the private registry as long as you're using the endpoint.
+ The Amazon VPC subnets and security groups that you specify for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-VpcConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-VpcConfig) parameter.
+ The endpoint configuration that you create with the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) action in the Amazon SageMaker API.
+ Any KMS keys or Amazon S3 buckets that you specify in the endpoint configuration.
**Reminder**  
Ensure you don’t disable these KMS keys.

## Follow these procedures to update your endpoints
<a name="procedures-to-update-endpoint"></a>

When you update your Amazon SageMaker AI endpoints, use any of the following procedures that apply to your needs.

**To update your model definition settings**

1. Create a new model definition with your updated settings by using the CreateModel action in the Amazon SageMaker API.

1. Create a new endpoint configuration that uses the new model definition. To do this, use the CreateEndpointConfig action in the Amazon SageMaker API.

1. Update your endpoint with the new endpoint configuration so that your updated model definition settings take effect.

1. (Optional) Delete the old endpoint configuration if you're not using it with any other endpoints. You can also delete the resources that you specified in the model definition if you're not using them with any other endpoints. These resources include model artifacts in Amazon S3 and inference images.

**To update your endpoint configuration**

1. Create a new endpoint configuration with your updated settings.

1. Update your endpoint with the new configuration so that your updates take effect.

1. (Optional) Delete the old endpoint configuration if you're not using it with any other endpoints. You can also delete the resources that you specified in the model definition if you're not using them with any other endpoints. These resources include model artifacts in Amazon S3 and inference images.

Whenever you create a new model definition or endpoint configuration, we recommend that you use a unique name. If you want to update these resources and retain their original names, use the following procedures.

**To update your model settings and retain the original model name**

1. Delete the existing model definition. At this point, any endpoint that uses the model is broken, but you fix this in the following steps.

1. Create the model definition again with your updated settings, and use the same model name.

1. Create a new endpoint configuration that uses the updated model definition.

1. Update your endpoint with the new endpoint configuration so that your updates take effect.

**To update your endpoint configuration and retain the original configuration name**

1. Delete the existing endpoint configuration.

1. Create a new endpoint configuration with your updated settings, and use the original name.

1. Update your endpoint with the new configuration so that your updates take effect.

# Updating inference containers to comply with the NVIDIA Container Toolkit
<a name="container-nvidia-compliance"></a>

As of versions 1.17.4 and higher, the NVIDIA Container Toolkit no longer mounts CUDA compatibility libraries automatically. This change in behavior could affect your SageMaker AI inference workloads. Your SageMaker AI endpoints and batch transform jobs might use containers that are incompatible with the latest versions of the NVIDIA Container Toolkit. To ensure that your workloads comply with the latest requirements, you might need to update your endpoints or configure your batch transform jobs.

## Updating SageMaker AI endpoints for compliance
<a name="endpoint-compliance"></a>

We recommend that you update your existing SageMaker AI endpoints or create new ones that support the latest default behavior.

To ensure your endpoint is compatible with latest versions of the NVIDIA Container Toolkit, follow these steps:

1. Update how you set up the CUDA compatibility libraries if you bring your own container.

1. Specify an inference Amazon Machine Image (AMI) that supports the latest NVIDIA Container Toolkit behavior. You specify an AMI when you update an existing endpoint or create a new one.

### Updating the CUDA compatibility setup if you bring your own container
<a name="cuda-compatibility"></a>

The CUDA compatibility libraries enable forward compatibility. This compatibility applies to any CUDA toolkit versions that are newer than the NVIDIA driver provided by the SageMaker AI instance.

You must enable the CUDA compatibility libraries only when the NVIDIA driver that the SageMaker AI instance uses has an older version than the CUDA toolkit in the model container. If your model container does not require CUDA compatibility, you can skip this step. For example, you can skip this step if you don't plan to use a newer CUDA toolkit than those provided by SageMaker AI instances.

Because of the changes introduced in the NVIDIA Container Toolkit version 1.17.4, you can explicitly enable CUDA compatibility libraries, if needed, by adding them to `LD_LIBRARY_PATH` in the container.

We suggest that you enable the CUDA compatibility based on the detected NVIDIA driver version. To enable it, add the code snippet below to the container startup shell script. Add this code at the `ENTRYPOINT` script.

The following script demonstrates how to dynamically switch the use of the CUDA compatibility based on the detected NVIDIA driver version on the deployed host for your model container.

```
#!/bin/bash

verlt() {
    [ "$1" = "$2" ] && return 1 || [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ]
}

if [ -f /usr/local/cuda/compat/libcuda.so.1 ]; then
    CUDA_COMPAT_MAX_DRIVER_VERSION=$(readlink /usr/local/cuda/compat/libcuda.so.1 | cut -d'.' -f 3-)
    echo "CUDA compat package should be installed for NVIDIA driver smaller than ${CUDA_COMPAT_MAX_DRIVER_VERSION}"
    NVIDIA_DRIVER_VERSION=$(sed -n 's/^NVRM.*Kernel Module *\([0-9.]*\).*$/\1/p' /proc/driver/nvidia/version 2>/dev/null || true)
    echo "Current installed NVIDIA driver version is ${NVIDIA_DRIVER_VERSION}"
    if verlt $NVIDIA_DRIVER_VERSION $CUDA_COMPAT_MAX_DRIVER_VERSION; then
        echo "Adding CUDA compat to LD_LIBRARY_PATH"
        export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
        echo $LD_LIBRARY_PATH
    else
        echo "Skipping CUDA compat setup as newer NVIDIA driver is installed"
    fi
else
    echo "Skipping CUDA compat setup as package not found"
fi
```

### Specifying an Inference AMI that complies with the NVIDIA Container Toolkit
<a name="specify-inference-ami"></a>

In the `InferenceAmiVersion` parameter of the `ProductionVariant` data type, you can select the AMI for a SageMaker AI endpoint. Each of the supported AMIs is a preconfigured image. Each image is configured by AWS with a set of software and driver versions.

By default, the SageMaker AI AMIs follow the legacy behavior. They automatically mount CUDA compatibility libraries in the container. To make an endpoint use the new behavior, you must specify an inference AMI version that is configured for the new behavior.

The following inference AMI versions currently follow the new behavior. They don't mount CUDA compatibility libraries automatically.

al2-ami-sagemaker-inference-gpu-2-1  
+ NVIDIA driver version: 535.54.03
+ CUDA version: 12.2

al2-ami-sagemaker-inference-gpu-3-1  
+ NVIDIA driver version: 550.144.01
+ CUDA version: 12.4

### Updating an existing endpoint
<a name="update-existing-endpoint"></a>

Use the following example to update an existing endpoint. The example uses an inference AMI version that disables automatic mounting of CUDA compatibility libraries.

```
ENDPOINT_NAME="<endpoint name>"
INFERENCE_AMI_VERSION="al2-ami-sagemaker-inference-gpu-3-1"

# Obtaining current endpoint configuration
CURRENT_ENDPOINT_CFG_NAME=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointConfigName" --output text)
NEW_ENDPOINT_CFG_NAME="${CURRENT_ENDPOINT_CFG_NAME}new"

# Copying Endpoint Configuration with AMI version specified
aws sagemaker describe-endpoint-config \
    --endpoint-config-name ${CURRENT_ENDPOINT_CFG_NAME} \
    --output json | \
jq "del(.EndpointConfigArn, .CreationTime) | . + {
    EndpointConfigName: \"${NEW_ENDPOINT_CFG_NAME}\",
    ProductionVariants: (.ProductionVariants | map(.InferenceAmiVersion = \"${INFERENCE_AMI_VERSION}\"))
}" > /tmp/new_endpoint_config.json

# Make sure all fields in the new endpoint config look as expected
cat /tmp/new_endpoint_config.json

# Creating new endpoint config
aws sagemaker create-endpoint-config \
   --cli-input-json file:///tmp/new_endpoint_config.json
    
# Updating the endpoint
aws sagemaker update-endpoint \
    --endpoint-name "$ENDPOINT_NAME" \
    --endpoint-config-name "$NEW_ENDPOINT_CFG_NAME" \
    --retain-all-variant-properties
```

### Creating a new endpoint
<a name="create-new-endpoint"></a>

Use the following example to create a new endpoint. The example uses an inference AMI version that disables automatic mounting of CUDA compatibility libraries.

```
INFERENCE_AMI_VERSION="al2-ami-sagemaker-inference-gpu-3-1"

aws sagemakercreate-endpoint-config \
 --endpoint-config-name "<endpoint_config>" \
 --production-variants '[{ \
    ....
    "InferenceAmiVersion":  "${INFERENCE_AMI_VERSION}", \
    ...
    "}]'

aws sagemaker create-endpoint \
--endpoint-name "<endpoint_name>" \
--endpoint-config-name "<endpoint_config>"
```

## Running compliant batch transform jobs
<a name="batch-compliance"></a>

*Batch transform* is the inference option that's best suited for requests to process large amounts of data offline. To create batch transform jobs, you use the `CreateTransformJob` API action. For more information, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

The changed behavior of the NVIDIA Container Toolkit affects batch transform jobs. To run a batch transform that complies with the NVIDIA Container Toolkit requirements, do the following:

1. If you want to run batch transform with a model for which you've brought your own container, first, update the container for CUDA compatibility. To update it, follow the process in [Updating the CUDA compatibility setup if you bring your own container](#cuda-compatibility).

1. Use the `CreateTransformJob` API action to create the batch transform job. In your request, set the `SAGEMAKER_CUDA_COMPAT_DISABLED` environment variable to `true`. This parameter instructs to the container not to automatically mount CUDA compatibility libraries.

   For example, when you create a batch transform job by using the AWS CLI, you set the environment variable with the `--environment` parameter:

   ```
   aws sagemaker create-transform-job \
       --environment '{"SAGEMAKER_CUDA_COMPAT_DISABLED": "true"}'\
       . . .
   ```

# Supported features
<a name="model-deploy-feature-matrix"></a>

 Amazon SageMaker AI offers the following four options to deploy models for inference. 
+  Real-time inference for inference workloads with real-time, interactive, low latency requirements. 
+  Batch transform for offline inference with large datasets. 
+  Asynchronous inference for near-real-time inference with large inputs that require longer preprocessing times. 
+  Serverless inference for inference workloads that have idle periods between traffic spurts. 

 The following table summarizes the core platform features that are supported by each inference option. It does not show features that can be provided by frameworks, custom Docker containers, or through chaining different AWS services. 


| Feature | [Real-time inference](realtime-endpoints.md) | [Batch transform](batch-transform.md) | [Asynchronous inference](async-inference.md) | [Serverless inference](serverless-endpoints.md) | [Docker containers](docker-containers.md) | 
| --- | --- | --- | --- | --- | --- | 
| [Autoscaling support](endpoint-auto-scaling.md) | ✓ | N/A | ✓ | ✓ | N/A | 
| GPU support | ✓1 | ✓1 | ✓1 |  | [1P](common-info-all-im-models.md), pre-built, BYOC | 
| Single model | ✓ | ✓ | ✓ | ✓ | N/A | 
| [Multi-model endpoint](multi-model-endpoints.md) | ✓ |  |  |  | k-NN, XGBoost, Linear Learner, RCF, TensorFlow, Apache MXNet, PyTorch, scikit-learn 2 | 
| [Multi-container endpoint](multi-container-endpoints.md) | ✓ |  |  |  | 1P, pre-built, Extend pre-built, BYOC | 
| [Serial inference pipeline](inference-pipelines.md) | ✓ | ✓ |  |  | 1P, pre-built, Extend pre-built, BYOC | 
| [Inference Recommender](inference-recommender.md) | ✓ |  |  |  | 1P, pre-built, Extend pre-built, BYOC | 
| Private link support | ✓ | ✓ | ✓ |  | N/A | 
| [Data capture/Model monitor support](model-monitor.md) | ✓ | ✓ |  |  | N/A | 
| [DLCs supported](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) | 1P, pre-built, Extend pre-built, BYOC | [1P](common-info-all-im-models.md), pre-built, Extend pre-built, BYOC | 1P, pre-built, Extend pre-built, BYOC | 1P, pre-built, Extend pre-built, BYOC | N/A | 
| Protocols supported | HTTP(S) | HTTP(S) | HTTP(S) | HTTP(S) | N/A | 
| Payload size | < 6 MB | ≤ 100 MB | ≤ 1 GB | ≤ 4 MB |  | 
| HTTP chunked encoding | Framework dependent, 1P not supported | N/A | Framework dependent, 1P not supported | Framework dependent, 1P not supported | N/A | 
| Request timeout | < 60 seconds | Days | < 1 hour | < 60 seconds | N/A | 
| [Deployment guardrails: blue/green deployments](deployment-guardrails.md) | ✓ | N/A | ✓ |  | N/A | 
| [Deployment guardrails: rolling deployments](deployment-guardrails.md) | ✓ | N/A | ✓ |  | N/A | 
| [Shadow testing](shadow-tests.md) | ✓ |  |  |  | N/A | 
| Scale to zero |  | N/A | ✓ | ✓ | N/A | 
| Market place model packages support | ✓ | ✓ | ✓ |  | N/A | 
| Virtual private cloud support | ✓ | ✓ | ✓ |  | N/A | 
| Multiple production variants support | ✓ |  |  |  | N/A | 
| Network isolation | ✓ |  | ✓ |  | N/A | 
| [Model parallel serving support](model-parallel-intro.md) | ✓3 | ✓ | ✓3 |  | ✓3 | 
| Volume encryption | ✓ | ✓ | ✓ | ✓ | N/A | 
| Customer AWS KMS | ✓ | ✓ | ✓ | ✓ | N/A | 
| d instance support | ✓ | ✓ | ✓ |  | N/A | 
| [inf1 support](neo-supported-cloud.md) | ✓ |  |  |  | ✓ | 

 With SageMaker AI, you can deploy a single model, or multiple models behind a single inference endpoint for real-time inference. The following table summarizes the core features supported by various hosting options that come with real-time inference. 


| Feature | [Single model endpoints](realtime-single-model.md) | [Multi-model endpoints](multi-model-endpoints.md) | [Serial inference pipeline](inference-pipelines.md) | [Multi-container endpoints](multi-container-endpoints.md) | 
| --- | --- | --- | --- | --- | 
| [Autoscaling support](endpoint-auto-scaling.md) | ✓ | ✓ | ✓ | ✓ | 
| GPU support | ✓1 | ✓ | ✓ |  | 
| Single model | ✓ | ✓ | ✓ | ✓ | 
| [Multi-model endpoints](multi-model-endpoints.md) |  | ✓ | ✓ | N/A | 
| [Multi-container endpoints](multi-container-endpoints.md) | ✓ |  |  | N/A | 
| [Serial inference pipeline](inference-pipelines.md) | ✓ | ✓ | N/A |  | 
| [Inference Recommender](inference-recommender.md) | ✓ |  |  |  | 
| Private link support | ✓ | ✓ | ✓ | ✓ | 
| [Data capture/Model monitor support](model-monitor.md) | ✓ | N/A | N/A | N/A | 
| DLCs supported | 1P, pre-built, Extend pre-built, BYOC | k-NN, XGBoost, Linear Learner, RCF, TensorFlow, Apache MXNet, PyTorch, scikit-learn 2 | 1P, pre-built, Extend pre-built, BYOC | 1P, pre-built, Extend pre-built, BYOC | 
| Protocols supported | HTTP(S) | HTTP(S) | HTTP(S) | HTTP(S) | 
| Payload size | < 6 MB | < 6 MB | < 6 MB | < 6 MB | 
| Request timeout | < 60 seconds | < 60 seconds | < 60 seconds | < 60 seconds | 
| [Deployment guardrails: blue/green deployments](deployment-guardrails.md) | ✓ | ✓ | ✓ | ✓ | 
| [Deployment guardrails: rolling deployments](deployment-guardrails.md) | ✓ | ✓ | ✓ | ✓ | 
| [Shadow testing](shadow-tests.md) | ✓ |  |  |  | 
| Market place model packages support | ✓ |  |  |  | 
| Virtual private cloud support | ✓ | ✓ | ✓ | ✓ | 
| Multiple production variants support | ✓ |  | ✓ | ✓ | 
| Network isolation | ✓ | ✓ | ✓ | ✓ | 
| [Model parallel serving support](model-parallel-intro.md) | ✓ 3 |  | ✓ 3 |  | 
| Volume encryption | ✓ | ✓ | ✓ | ✓ | 
| Customer AWS KMS | ✓ | ✓ | ✓ | ✓ | 
| d instance support | ✓ | ✓ | ✓ | ✓ | 
| [inf1 support](neo-supported-cloud.md) | ✓ |  |  |  | 

 1 Availability of the Amazon EC2 instance types depends on the AWS Region. For availability of instances specific to AWS, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/). 

 2 To use any other framework or algorithm, use the SageMaker AI Inference toolkit to build a container that supports multi-model endpoints. 

 3 With SageMaker AI, you can deploy large models (up to 500 GB) for inference. You can configure the container health check and download timeout quotas, up to 60 minutes. This will allow you to have more time to download and load your model and associated resources. For more information, see [SageMaker AI endpoint parameters for large model inference](large-model-inference-hosting.md). You can use SageMaker AI compatible [large model Inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). You can also use third-party model parallelization libraries, such as Triton with FasterTransformer and DeepSpeed. You have to ensure that they are compatible with SageMaker AI. 

# Resources
<a name="inference-resources"></a>

Use the following resources for troubleshooting and reference, answerings FAQS, and learning more about Amazon SageMaker AI.

**Topics**
+ [

# Blogs, example notebooks, and additional resources
](deploy-model-blogs.md)
+ [

# Troubleshooting and reference
](deploy-model-reference.md)
+ [

# Model Hosting FAQs
](hosting-faqs.md)

# Blogs, example notebooks, and additional resources
<a name="deploy-model-blogs"></a>

The following sections contain examples and additional resources for you to learn more about Amazon SageMaker AI.

## Blogs and case studies
<a name="deploy-model-blogs-table"></a>

See the following table for lists of blogs and case studies for various features within SageMaker AI Inference. You can use the blogs to help put together solutions that work for your use case.


| Feature | Resources | 
| --- | --- | 
|  Real-Time Inference  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Autoscaling  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Serverless Inference  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Asynchronous Inference  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Batch Transform  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Multi-Model Endpoints  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Serial Inference Pipelines  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Multi-Container Endpoints  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Running Model Ensembles  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Inference Recommender  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Advanced model hosting blog series  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 

## Example notebooks
<a name="deploy-model-blogs-nbs"></a>

See the following table for example notebooks that can help you learn more about SageMaker AI Inference.


| Feature | Example notebooks | 
| --- | --- | 
|  Inference Recommender  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-blogs.html)  | 
|  Optimize large language models (LLMs) for SageMaker AI  |  [Generative AI LLMs workshop](https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop)  | 

## Additional resources
<a name="deploy-model-blogs-extras"></a>

For more information about each SageMaker AI Inference option in detail, you can watch the following video.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/4FqHt5bmS2o/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/4FqHt5bmS2o)


# Troubleshooting and reference
<a name="deploy-model-reference"></a>

You can use the following resources and reference documentation to understand best practices when using SageMaker AI Inference and to troubleshoot issues with model deployments:
+ For troubleshooting model deployments, see [Troubleshoot Amazon SageMaker AI model deployments](deploy-model-troubleshoot.md).
+ For model deployment best practices, see [Best practices](https://docs.aws.amazon.com/sagemaker/latest/dg/best-practices.html).
+ For reference information about the size of storage volumes provided for different sizes of hosting instances, see [Instance storage volumes](host-instance-storage.md).
+ For reference information about SageMaker AI limits and quotas, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).
+ For frequently asked questions about SageMaker AI, see [Model Hosting FAQs](hosting-faqs.md).

# Model Hosting FAQs
<a name="hosting-faqs"></a>

Refer to the following FAQ items for answers to commonly asked questions about SageMaker AI Inference Hosting.

## General Hosting
<a name="hosting-faqs-general"></a>

The following FAQ items answer common general questions for SageMaker AI Inference.

### Q: What deployment options does Amazon SageMaker AI provide?
<a name="hosting-faqs-general-1"></a>

A: After you build and train models, Amazon SageMaker AI provides four options to deploy them so you can start making predictions. Real-Time Inference is suitable for workloads with millisecond latency requirements, payload sizes up to 25 MB, and processing times of up to 60 seconds for regular responses and 8 minutes for streaming responses. Batch Transform is ideal for offline predictions on large batches of data that are available up front. Asynchronous Inference is designed for workloads that do not have sub-second latency requirements, payload sizes up to 1 GB, and processing times of up to 60 minutes. With Serverless Inference, you can quickly deploy machine learning models for inference without having to configure or manage the underlying infrastructure, and you pay only for the compute capacity used to process inference requests, which is ideal for intermittent workloads.

### Q: How do I choose a model deployment option in SageMaker AI?
<a name="hosting-faqs-general-2"></a>

If you want to process requests in batches, you might want to choose Batch Transform. Otherwise, if you want to receive inference for each request to your model, you might want to choose Asynchronous Inference, Serverless Inference, or Real-Time Inference. You can choose Asynchronous Inference if you have long processing times or large payloads and want to queue requests. You can choose Serverless Inference if your workload has unpredictable or intermittent traffic. You can choose Real-Time Inference if you have sustained traffic and need lower and consistent latency for your requests.

### Q: I’ve heard SageMaker AI Inference is expensive. What’s the best way to optimize my cost when hosting models?
<a name="hosting-faqs-general-3"></a>

A: To optimize your costs with SageMaker AI Inference, you should choose the right hosting option for your use case. You can also use Inference features such as [Amazon SageMaker AI Savings Plans](https://aws.amazon.com/savingsplans/ml-pricing/), model optimization with [SageMaker Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html), [Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) and [Multi-Container Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html), or autoscaling. For tips on how to optimize your Inference costs, see [Inference cost optimization best practices](inference-cost-optimization.md).

### Q: Why should I use Amazon SageMaker Inference Recommender?
<a name="hosting-faqs-general-4"></a>

A: You should use Amazon SageMaker Inference Recommender if you need recommendations for the right endpoint configuration to improve performance and reduce costs. Previously, data scientists who wanted to deploy their models had to run manual benchmarks to select the right endpoint configuration. First, they had to select the right machine learning instance type out of more than 70 available instance types based on the resource requirements of their models and sample payloads, and then optimize the model to account for differing hardware. Then, they had to conduct extensive load tests to validate that latency and throughput requirements were met and that the costs were low. Inference Recommender eliminates this complexity by helping you do the following: 
+ Get started in minutes with an instance recommendation.
+ Conduct load tests across instance types to get recommendations on your endpoint configuration within hours. 
+ Automatically tune container and model server parameters as well as perform model optimizations for a given instance type.

### Q: What is a model server?
<a name="hosting-faqs-general-5"></a>

A: SageMaker AI endpoints are HTTP REST endpoints that use a containerized web server, which includes a model server. These containers are responsible for loading up and serving requests for a machine learning model. They implement a web server that responds to `/invocations` and `/ping` on port 8080.

Common model servers include TensorFlow Serving, TorchServe and Multi Model Server. SageMaker AI framework containers have these model servers built in.

### Q: What is Bring Your Own Container with Amazon SageMaker AI?
<a name="hosting-faqs-general-6"></a>

A: Everything in SageMaker AI Inference is containerized. SageMaker AI provides managed containers for popular frameworks such as TensorFlow, SKlearn, and HuggingFace. For a comprehensive updated list of those images, see [Available Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

 Sometimes there are custom frameworks for which you might need to build a container. This approach is known as *Bring Your Own Container* or *BYOC*. With the BYOC approach, you provide the Docker image to set up your framework or library. Then, you push the image to Amazon Elastic Container Registry (Amazon ECR) so that you can use the image with SageMaker AI.

Alternatively, instead of building an image from scratch, you can extend a container. You can take one of the base images that SageMaker AI provides and add your dependencies on top of it in your Dockerfile.

### Q: Do I need to train my models on SageMaker AI to host them on SageMaker AI endpoints?
<a name="hosting-faqs-general-7"></a>

A: SageMaker AI offers the capacity to bring your own trained framework model that you've trained outside of SageMaker AI and deploy it on any of the SageMaker AI hosting options.

SageMaker AI requires you to package the model in a `model.tar.gz` file and have a specific directory structure. Each framework has its own model structure (see the following question for example structures). For more information, see the SageMaker Python SDK documentation for [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#deploying-directly-from-model-artifacts), [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#bring-your-own-model), and [MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#deploy-endpoints-from-model-data).

While you can choose from prebuilt framework images such as TensorFlow, PyTorch, and MXNet to host your trained model, you can also build your own container to host your trained models on SageMaker AI endpoints. For a walkthrough, see the example Jupyter notebook [Building your own algorithm container](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb).

### Q: How should I structure my model if I want to deploy on SageMaker AI but not train on SageMaker AI?
<a name="hosting-faqs-general-8"></a>

A: SageMaker AI requires your model artifacts to be compressed in a `.tar.gz` file, or a *tarball*. SageMaker AI automatically extracts this `.tar.gz` file into the `/opt/ml/model/` directory in your container. The tarball shouldn't contain any symlinks or unncessary files. If you are making use of one of the framework containers, such as TensorFlow, PyTorch, or MXNet, the container expects your TAR structure to be as follows: 

**TensorFlow**

```
model.tar.gz/
             |--[model_version_number]/
                                       |--variables
                                       |--saved_model.pb
            code/
                |--inference.py
                |--requirements.txt
```

**PyTorch**

```
model.tar.gz/
             |- model.pth
             |- code/
                     |- inference.py
                     |- requirements.txt  # only for versions 1.3.1 and higher
```

**MXNet**

```
model.tar.gz/
            |- model-symbol.json
            |- model-shapes.json
            |- model-0000.params
            |- code/
                    |- inference.py
                    |- requirements.txt # only for versions 1.6.0 and higher
```

### Q: When invoking a SageMaker AI endpoint, I can provide a `ContentType` and `Accept` MIME Type. Which one is used to identify the data type being sent and received?
<a name="hosting-faqs-general-10"></a>

A: `ContentType` is the MIME type of the input data in the request body (the MIME type of the data you are sending to your endpoint). The model server uses the `ContentType` to determine if it can handle the type provided or not.

`Accept` is the MIME type of the inference response (the MIME type of the data your endpoint returns). The model server uses the `Accept` type to determine if it can handle returning the type provided or not.

Common MIME types include `text/csv`, `application/json`, and `application/jsonlines`.

### Q: What are the supported data formats for SageMaker AI Inference?
<a name="hosting-faqs-general-12"></a>

A: SageMaker AI passes any request onto the model container without modification. The container must contain the logic to deserialize the request. For information about the formats defined for built-in algorithms, see [ Common Data Formats for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). If you are building your own container or using a SageMaker AI Framework container, you can include the logic to accept a request format of your choice.

Similarly, SageMaker AI also returns the response without modification, and then the client must deserialize the response. In case of the built-in algorithms, they return responses in specific formats. If you are building your own container or using a SageMaker AI Framework container, you can include the logic to return a response in the format you choose.

### Q: How do I invoke my endpoint with binary data such as videos or images?
<a name="hosting-faqs-general-11"></a>

Use the [Invoke Endpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API call to make inference against your endpoint.

When passing your input as a payload to the `InvokeEndpoint` API, you must provide the correct type of input data that your model expects. When passing a payload in the `InvokeEndpoint` API call, the request bytes are forwarded directly to the model container. For example, for an image, you may use `application/jpeg` for the `ContentType`, and make sure that your model can perform inference on this type of data. This applies for JSON, CSV, video, or any other type of input with which you may be dealing.

Another factor to consider is payload size limits. The payload limits are 25 MB for real-time endpoints and 4 MB for serverless endpoints. You can split your video into multiple frames and invoke the endpoint with each frame individually. Alternatively, if your use case permits, you can send the whole video in the payload using an asynchronous endpoint, which supports up to 1 GB payloads.

For an example that showcases how to run computer vision inference on large videos with Asynchronous Inference, see this [blog post](https://aws.amazon.com/blogs/machine-learning/run-computer-vision-inference-on-large-videos-with-amazon-sagemaker-asynchronous-endpoints/).

## Real-Time Inference
<a name="hosting-faqs-real-time"></a>

The following FAQ items answer common questions for SageMaker AI Real-Time Inference.

### Q: How do I create a SageMaker AI endpoint?
<a name="hosting-faqs-real-time-1"></a>

A: You can create a SageMaker AI endpoint through AWS-supported tooling such as the AWS SDKs, the SageMaker Python SDK, the AWS Management Console, AWS CloudFormation, and the AWS Cloud Development Kit (AWS CDK).

There are three key entities in endpoint creation: a SageMaker AI model, a SageMaker AI endpoint configuration, and a SageMaker AI endpoint. The SageMaker AI model points towards the model data and image you are using. The endpoint configuration defines your production variants, which might include the instance type and instance count. You can then use either the [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) API call or the [.deploy()](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) call for SageMaker AI to create an endpoint using the metadata from your model and endpoint configuration.

### Q: Do I need to use the SageMaker Python SDK to create/invoke endpoints?
<a name="hosting-faqs-real-time-2"></a>

A: No, you can use the various AWS SDKs (see [Invoke](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_SeeAlso)/[Create](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html#API_CreateEndpoint_SeeAlso) for available SDKs) or even call the corresponding web APIs directly.

### Q: What is the difference between Multi-Model Endpoints (MME) and Multi Model Server (MMS)?
<a name="hosting-faqs-real-time-3"></a>

A: A Multi-Model Endpoint is a Real-Time Inference option that SageMaker AI provides. With Multi-Model Endpoints, you can host thousands of models behind one endpoint. [Multi Model Server](https://github.com/awslabs/multi-model-server) is an open-source framework for serving machine learning models. It provides the HTTP front-end and model management capabilities required by multi-model endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and perform inference on a specified loaded model.

### Q: What are the different model deployment architectures supported by Real-Time Inference?
<a name="hosting-faqs-real-time-4"></a>

A: SageMaker AI Real-Time Inference supports various model deployment architecture such as Multi-Model Endpoints, Multi-Container Endpoints, and Serial Inference Pipelines. 

[Multi-Model Endpoints (MME)](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) – MME allows customers to deploy 1000s of hyper‐personalized models in a cost effective way. All the models are deployed on a shared‐resource fleet. MME works best when the models are of similar size and latency and belong to the same ML framework. These endpoints are ideal for when you have don’t need to call the same model at all times. You can dynamically load respective models onto the SageMaker AI endpoint to serve your request.

[Multi-Container Endpoints (MCE)](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html) – MCE allows customers to deploy 15 different containers with diverse ML frameworks and functionalities with no cold starts while only using one SageMaker endpoint. You can directly invoke these containers. MCE is best for when you want to keep all the models in memory.

[Serial Inference Pipelines (SIP)](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) – You can use SIP to chain together 2‐15 containers on a single endpoint. SIP is mostly suitable for combining preprocessing and model inference in one endpoint and for low latency operations.

## Serverless Inference
<a name="hosting-faqs-serverless"></a>

The following FAQ items answer common questions for Amazon SageMaker Serverless Inference.

### Q: What is Amazon SageMaker Serverless Inference?
<a name="hosting-faqs-serverless-1"></a>

A: [Deploy models with Amazon SageMaker Serverless Inference](serverless-endpoints.md) is a purpose-built serverless model serving option that makes it easy to deploy and scale ML models. Serverless Inference endpoints automatically start compute resources and scale them in and out depending on traffic, eliminating the need for you to choose instance type, run provisioned capacity, or manage scaling. You can optionally specify the memory requirements for your serverless endpoint. You pay only for the duration of running the inference code and the amount of data processed, not for idle periods.

### Q: Why should I use Serverless Inference?
<a name="hosting-faqs-serverless-2"></a>

A: Serverless Inference simplifies the developer experience by eliminating the need to provision capacity up front and manage scaling policies. Serverless Inference can scale instantly from tens to thousands of inferences within seconds based on the usage patterns, making it ideal for ML applications with intermittent or unpredictable traffic. For example, a chatbot service used by a payroll processing company experiences an increase in inquiries at the end of the month while traffic is intermittent for rest of the month. Provisioning instances for the entire month in such scenarios is not cost-effective, as you end up paying for idle periods.

Serverless Inference helps address these types of use cases by providing you automatic and fast scaling out of the box without the need for you to forecast traffic up front or manage scaling policies. Additionally, you pay only for the compute time to run your inference code and for data processing, making it ideal for workloads with intermittent traffic.

### Q: How do I choose the right memory size for my serverless endpoint?
<a name="hosting-faqs-serverless-3"></a>

A: Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs.

Choose your endpoint’s memory size according to your model size. Generally, the memory size should be at least as large as your model size. You may need to benchmark in order to choose the right memory selection for your model based on your latency SLAs. The memory size increments have different pricing; see the [Amazon SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/) for more information.

## Batch Transform
<a name="hosting-faqs-batch"></a>

The following FAQ items answer common questions for SageMaker AI Batch Transform.

### Q: How does Batch Transform split my data?
<a name="hosting-faqs-batch-1"></a>

A: For specific file formats such as CSV, RecordIO and TFRecord, SageMaker AI can split your data into single-record or multi-record mini batches and send this as a payload to your model container. When the value of `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-BatchStrategy)` is `MultiRecord`, SageMaker AI sends the maximum number of records in each request, up to the `MaxPayloadInMB` limit. When the value of `BatchStrategy` is `SingleRecord`, SageMaker AI sends individual records in each request.

### Q: What is the maximum timeout for Batch Transform and payload limit for a single record?
<a name="hosting-faqs-batch-2"></a>

A: The maximum timeout for Batch Transform is 3600 seconds. The [maximum payload size](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB) for a record (per mini batch) is 100 MB.

### Q: How do I speed up a Batch Transform job?
<a name="hosting-faqs-batch-3"></a>

A: If you are using the `[CreateTransformJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html)` API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as `[MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB)`, `[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxConcurrentTransforms)`, or `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-BatchStrategy)`. The ideal value for `MaxConcurrentTransforms` is equal to the number of compute workers in the batch transform job. If you are using the SageMaker AI console, you can specify these optimal parameter values in the **Additional configuration** section of the **Batch transform job configuration** page. SageMaker AI automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an [execution-parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containe-serves-requests) endpoint.

### Q: What are the data formats natively supported in Batch Transform?
<a name="hosting-faqs-batch-4"></a>

A: Batch Transform supports CSV and JSON.

## Asynchronous Inference
<a name="hosting-faqs-async"></a>

The following FAQ items answer common general questions for SageMaker AI Asynchronous Inference.

### Q: What is Amazon SageMaker Asynchronous Inference?
<a name="hosting-faqs-async-1"></a>

A: Asynchronous Inference queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes or long processing times that need to be processed as they arrive. Optionally, you can configure auto-scaling settings to scale down the instance count to zero when not actively processing requests. 

### Q: How do I scale my endpoints to 0 when there’s no traffic?
<a name="hosting-faqs-async-2"></a>

A: Amazon SageMaker AI supports automatic scaling (autoscaling) your asynchronous endpoint. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Unlike other hosted models SageMaker AI supports, with Asynchronous Inference you can also scale down your asynchronous endpoints instances to zero. Requests that are received when there are zero instances are queued for processing once the endpoint scales up. For more information, see [Autoscale an asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-autoscale.html).

Amazon SageMaker Serverless Inference also automatically scales down to zero. You won’t see this because SageMaker AI manages scaling your serverless endpoints, but if you are not experiencing any traffic, the same infrastructure applies.