# Model servers for model deployment with Amazon SageMaker AI
<a name="deploy-model-frameworks"></a>

You can use popular model servers, such as TorchServe, DJL Serving, and Triton Inference Server, to deploy your models on SageMaker AI. The following topics explain how.

**Topics**
+ [Deploy models with TorchServe](deploy-models-frameworks-torchserve.md)
+ [Deploy models with DJL Serving](deploy-models-frameworks-djl-serving.md)
+ [Model deployment with Triton Inference Server](deploy-models-frameworks-triton.md)

# Deploy models with TorchServe
<a name="deploy-models-frameworks-torchserve"></a>

TorchServe is the recommended model server for PyTorch, preinstalled in the AWS PyTorch Deep Learning Container (DLC). This powerful tool offers customers a consistent and user-friendly experience, delivering high performance in deploying multiple PyTorch models across various AWS instances, including CPU, GPU, Neuron, and Graviton, regardless of the model size or distribution.

TorchServe supports a wide array of advanced features, including dynamic batching, microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX and IPEX. Moreover, it seamlessly integrates PyTorch's large model solution, PiPPy, enabling efficient handling of large models. Additionally, TorchServe extends its support to popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more, expanding its capabilities even further. With TorchServe, AWS users can confidently deploy and serve their PyTorch models, taking advantage of its versatility and optimized performance across various hardware configurations and model types. For more detailed information, you can refer to the [PyTorch documentation](https://pytorch.org/serve/) and [TorchServe on GitHub](https://github.com/pytorch/serve).

The following table lists the AWS PyTorch DLCs supported by TorchServe.


| Instance type | SageMaker AI PyTorch DLC link | 
| --- | --- | 
| CPU and GPU | [SageMaker AI PyTorch containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) | 
| Neuron | [PyTorch Neuron containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) | 
| Graviton | [SageMaker AI PyTorch Graviton containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-graviton-containers-sm-support-only) | 

The following sections describe the setup to build and test PyTorch DLCs on Amazon SageMaker AI.

## Getting started
<a name="deploy-models-frameworks-torchserve-prereqs"></a>

To get started, ensure that you have the following prerequisites:

1. Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AWSServiceRoleForAmazonEKSNodegroup](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSServiceRoleForAmazonEKSNodegroup)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

1. Locally configure your dependencies, as shown in the following example:

   ```
   from datetime import datetime
       import os
       import json
       import logging
       import time
       
       # External Dependencies:
       import boto3
       from botocore.exceptions import ClientError
       import sagemaker
       
       sess = boto3.Session()
       sm = sess.client("sagemaker")
       region = sess.region_name
       account = boto3.client("sts").get_caller_identity().get("Account")
       
       smsess = sagemaker.Session(boto_session=sess)
       role = sagemaker.get_execution_role()
       
       # Configuration:
       bucket_name = smsess.default_bucket()
       prefix = "torchserve"
       output_path = f"s3://{bucket_name}/{prefix}/models"
       print(f"account={account}, region={region}, role={role}")
   ```

1. Retrieve the PyTorch DLC image, as shown in the following example.

   SageMaker AI PyTorch DLC images are available in all AWS regions. For more information, see the [list of DLC container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).

   ```
   baseimage = sagemaker.image_uris.retrieve(
           framework="pytorch",
           region="<region>",
           py_version="py310",
           image_scope="inference",
           version="2.0.1",
           instance_type="ml.g4dn.16xlarge",
       )
   ```

1. Create a local workspace.

   ```
   mkdir -p workspace/
   ```

## Adding a package
<a name="deploy-models-frameworks-torchserve-package"></a>

The following sections describe how to add and preinstall packages to your PyTorch DLC image.

**BYOC use cases**

The following steps outline how to add a package to your PyTorch DLC image. For more information about customizing your container, see [Building AWS Deep Learning Containers Custom Images](https://github.com/aws/deep-learning-containers/blob/master/custom_images.md).

1. Suppose you want to add a package to the PyTorch DLC docker image. Create a Dockerfile under the `docker` directory, as shown in the following example:

   ```
   mkdir -p workspace/docker
       cat workspace/docker/Dockerfile
       
       ARG BASE_IMAGE
       
       FROM $BASE_IMAGE
       
       #Install any additional libraries
       RUN pip install transformers==4.28.1
   ```

1. Build and publish the customized docker image by using the following [ build\$1and\$1push.sh](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/workspace/docker/build_and_push.sh) script.

   ```
   # Download script build_and_push.sh to workspace/docker
       ls workspace/docker
       build_and_push.sh  Dockerfile
       
       # Build and publish your docker image
       reponame = "torchserve"
       versiontag = "demo-0.1"
       
       ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}
   ```

**SageMaker AI preinstall use cases**

The following example shows you how to preinstall a package to your PyTorch DLC container. You must create a `requirements.txt` file locally under the directory `workspace/code`.

```
mkdir -p workspace/code
    cat workspace/code/requirements.txt
    
    transformers==4.28.1
```

## Create TorchServe model artifacts
<a name="deploy-models-frameworks-torchserve-artifacts"></a>

In the following example, we use the pre-trained [ MNIST model](https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist). We create a directory `workspace/mnist`, implement [mnist\$1handler.py](https://github.com/pytorch/serve/blob/master/examples/image_classifier/mnist/mnist_handler.py) by following the [TorchServe custom service instructions](https://github.com/pytorch/serve/blob/master/docs/custom_service.md#custom-service), and [configure the model parameters](https://github.com/pytorch/serve/tree/master/model-archiver#config-file) (such as batch size and workers) in [model-config.yaml](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/workspace/lama/model-config.yaml). Then, we use the TorchServe tool `torch-model-archiver` to build the model artifacts and upload to Amazon S3.

1. Configure the model parameters in `model-config.yaml`.

   ```
   ls -al workspace/mnist-dev
       
       mnist.py
       mnist_handler.py
       mnist_cnn.pt
       model-config.yaml
       
       # config the model
       cat workspace/mnist-dev/model-config.yaml
       minWorkers: 1
       maxWorkers: 1
       batchSize: 4
       maxBatchDelay: 200
       responseTimeout: 300
   ```

1. Build the model artifacts by using [torch-model-archiver ](https://github.com/pytorch/serve/tree/master/model-archiver#torch-model-archiver-for-torchserve).

   ```
   torch-model-archiver --model-name mnist --version 1.0 --model-file workspace/mnist-dev/mnist.py --serialized-file workspace/mnist-dev/mnist_cnn.pt --handler workspace/mnist-dev/mnist_handler.py --config-file workspace/mnist-dev/model-config.yaml --archive-format tgz
   ```

   If you want to preinstall a package, you must include the `code` directory in the `tar.gz` file.

   ```
   cd workspace
       torch-model-archiver --model-name mnist --version 1.0 --model-file mnist-dev/mnist.py --serialized-file mnist-dev/mnist_cnn.pt --handler mnist-dev/mnist_handler.py --config-file mnist-dev/model-config.yaml --archive-format no-archive
       
       cd mnist
       mv ../code .
       tar cvzf mnist.tar.gz .
   ```

1. Upload `mnist.tar.gz` to Amazon S3.

   ```
   # upload mnist.tar.gz to S3
       output_path = f"s3://{bucket_name}/{prefix}/models"
       aws s3 cp mnist.tar.gz {output_path}/mnist.tar.gz
   ```

## Using single model endpoints to deploy with TorchServe
<a name="deploy-models-frameworks-torchserve-single-model"></a>

The following example shows you how to create a [single model real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html), deploy the model to the endpoint, and test the endpoint by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).

```
from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  predictor_cls = Predictor,
                  name = "mnist",
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    predictor = model.deploy(instance_type='ml.g4dn.xlarge',
                             initial_instance_count=1,
                             endpoint_name = endpoint_name,
                             serializer=JSONSerializer(),
                             deserializer=JSONDeserializer())  
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(dummy_data)
```

## Using multi-model endpoints to deploy with TorchServe
<a name="deploy-models-frameworks-torchserve-multi-model"></a>

[Multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) are a scalable and cost-effective solution to hosting large numbers of models behind one endpoint. They improve endpoint utilization by sharing the same fleet of resources and serving container to host all of your models. They also reduce deployment overhead because SageMaker AI manages dynamically loading and unloading models, as well as scaling resources based on traffic patterns. Multi-model endpoints are particularly useful for deep learning and generative AI models that require accelerated compute power.

By using TorchServe on SageMaker AI multi-model endpoints, you can speed up your development by using a serving stack that you are familiar with while leveraging the resource sharing and simplified model management that SageMaker AI multi-model endpoints provide.

The following example shows you how to create a multi-model endpoint, deploy the model to the endpoint, and test the endpoint by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). Additional details can be found in this [notebook example](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/torchserve/mme-gpu/torchserve_multi_model_endpoint.ipynb).

```
from sagemaker.multidatamodel import MultiDataModel
    from sagemaker.model import Model
    from sagemaker.predictor import Predictor
    
    # create the single model endpoint and deploy it on SageMaker AI
    model = Model(model_data = f'{output_path}/mnist.tar.gz', 
                  image_uri = baseimage,
                  role = role,
                  sagemaker_session = smsess)
                  
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    mme = MultiDataModel(
        name = endpoint_name,
        model_data_prefix = output_path,
        model = model,
        sagemaker_session = smsess)
    
    mme.deploy(
        initial_instance_count = 1,
        instance_type = "ml.g4dn.xlarge",
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.JSONDeserializer())
    
    # list models
    list(mme.list_models())
    
    # create mnist v2 model artifacts
    cp mnist.tar.gz mnistv2.tar.gz
    
    # add mnistv2
    mme.add_model(mnistv2.tar.gz)
    
    # list models
    list(mme.list_models())
    
    predictor = Predictor(endpoint_name=mme.endpoint_name, sagemaker_session=smsess)
                             
    # test the endpoint
    import random
    import numpy as np
    dummy_data = {"inputs": np.random.rand(16, 1, 28, 28).tolist()}
    
    res = predictor.predict(date=dummy_data, target_model="mnist.tar.gz")
```

## Metrics
<a name="deploy-models-frameworks-torchserve-metrics"></a>

TorchServe supports both system level and model level metrics. You can enable metrics in either log format mode or Prometheus mode through the environment variable `TS_METRICS_MODE`. You can use the TorchServe central metrics config file `metrics.yaml` to specify the types of metrics to be tracked, such as request counts, latency, memory usage, GPU utilization, and more. By referring to this file, you can gain insights into the performance and health of the deployed models and effectively monitor the TorchServe server's behavior in real-time. For more detailed information, see the [TorchServe metrics documentation](https://github.com/pytorch/serve/blob/master/docs/metrics.md#torchserve-metrics).

You can access TorchServe metrics logs that are similar to the StatsD format through the Amazon CloudWatch log filter. The following is an example of a TorchServe metrics log:

```
CPUUtilization.Percent:0.0|#Level:Host|#hostname:my_machine_name,timestamp:1682098185
    DiskAvailable.Gigabytes:318.0416717529297|#Level:Host|#hostname:my_machine_name,timestamp:1682098185
```

# Deploy models with DJL Serving
<a name="deploy-models-frameworks-djl-serving"></a>

DJL Serving is a high performance universal stand-alone model serving solution. It takes a deep learning model, several models, or workflows and makes them available through an HTTP endpoint.

You can use one of the DJL Serving [Deep Learning Containers (DLCs)](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html) to serve your models on AWS. To learn about the supported model types and frameworks, see the [DJL Serving GitHub repository](https://github.com/deepjavalibrary/djl-serving).

DJL Serving offers many features that help you to deploy your models with high performance:
+ Ease of use – DJL Serving can serve most models without any modifications. You bring your model artifacts, and DJL Serving can host them.
+ Multiple device and accelerator support – DJL Serving supports deploying models on CPUs, GPUs, and AWS Inferentia.
+ Performance – DJL Serving runs multithreaded inference in a single Java virtual machine (JVM) to boost throughput.
+ Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
+ Auto scaling – DJL Serving automatically scales workers up or down based on the traffic load.
+ Multi-engine support – DJL Serving can simultaneously host models using different frameworks (for example, PyTorch and TensorFlow).
+ Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models and can execute parts of the workflow on CPUs and other parts on GPUs. Models within a workflow can leverage different frameworks.

The following sections describe how to set up an endpoint with DJL Serving on SageMaker AI.

## Getting started
<a name="deploy-models-frameworks-djl-prereqs"></a>

To get started, ensure that you have the following prerequisites:

1. Ensure that you have access to an AWS account. Set up your environment so that the AWS CLI can access your account through either an AWS IAM user or an IAM role. We recommend using an IAM role. For the purposes of testing in your personal account, you can attach the following managed permissions policies to the IAM role:
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

1. Ensure that you have the [docker](https://docs.docker.com/get-docker/) client set up on your system.

1. Log in to Amazon Elastic Container Registry and set the following environment variables:

   ```
   export ACCOUNT_ID=<your_account_id>
   export REGION=<your_region>
   aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
   ```

1. Pull the docker image.

   ```
   docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
   ```

   For all of the available DJL Serving container images, see the [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and the [DJL Serving CPU inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#djl-cpu-full-inference-containers). When choosing an image from the tables in the preceding links, replace the AWS region in the example URL column with the region you are in. The DLCs are available in the regions listed in the table at the top of the [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) page.

## Customize your container
<a name="deploy-models-frameworks-djl-byoc"></a>

You can add packages to the base DLC images to customize your container. Suppose you want to add a package to the `763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118` docker image. You must create a dockerfile with your desired image as the base image, add the required packages, and push the image to Amazon ECR.

To add a package, complete the following steps:

1. Specify instructions for running your desired libraries or packages in the base image's dockerfile.

   ```
   FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118
                           
   ## add custom packages/libraries
   RUN git clone https://github.com/awslabs/amazon-sagemaker-examples
   ```

1. Build the Docker image from the dockerfile. Specify your Amazon ECR repository, the name of the base image, and a tag for the image. If you don't have an Amazon ECR repository, see [ Using Amazon ECR with the AWS CLI](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html) in the *Amazon ECR User Guide* for instructions on how to create one.

   ```
   docker build -f Dockerfile -t <registry>/<image_name>:<image_tag>
   ```

1. Push the Docker image to your Amazon ECR repository.

   ```
   docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/<image_name>:<image_tag>
   ```

You should now have a customized container image that you can use for model serving. For more examples of customizing your container, see [Building AWS Deep Learning Containers Custom Images](https://github.com/aws/deep-learning-containers/blob/master/custom_images.md).

## Prepare your model artifacts
<a name="deploy-models-frameworks-djl-artifacts"></a>

Before deploying your model on SageMaker AI, you must package your model artifacts in a `.tar.gz` file. DJL Serving accepts the following artifacts in your archive:
+ Model checkpoint: Files that store your model weights.
+ `serving.properties`: A configuration file that you can add for each model. Place `serving.properties` in the same directory as your model file.
+ `model.py`: The inference handler code. This is only applicable when using Python mode. If you don't specify `model.py`, djl-serving uses one of the default handlers.

The following is an example of a `model.tar.gz` structure:

```
 - model_root_dir # root directory
    - serving.properties            
    - model.py # your custom handler file for Python, if you choose not to use the default handlers provided by DJL Serving
    - model binary files # used for Java mode, or if you don't want to use option.model_id and option.s3_url for Python mode
```

DJL Serving supports Java engines powered by DJL or Python engines. Not all of the preceding artifacts are required; the required artifacts vary based on the mode you choose. For example, in Python mode, you only need to specify `option.model_id` in the `serving.properties` file; you don't need to specify the model checkpoint inside LMI containers. In Java mode, you are required to package the model checkpoint. For more details on how to configure `serving.properties` and operate with different engines, see [DJL Serving Operation Modes](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md).

## Use single model endpoints to deploy with DJL Serving
<a name="deploy-models-frameworks-djl-single-model"></a>

After preparing your model artifacts, you can deploy your model to a SageMaker AI endpoint. This section describes how to deploy a single model to an endpoint with DJL Serving. If you're deploying multiple models, skip this section and go to [Use multi-model endpoints to deploy with DJL Serving](#deploy-models-frameworks-djl-mme).

The following example shows you a method to create a model object using the Amazon SageMaker Python SDK. You'll need to specify the following fields:
+ `image_uri`: You can either retrieve one of the base DJL Serving images as shown in this example, or you can specify a custom Docker image from your Amazon ECR repository, if you followed the instructions in [Customize your container](#deploy-models-frameworks-djl-byoc).
+ `model_s3_url`: This should an Amazon S3 URI that points to your `.tar.gz`file.
+ `model_name`: Specify a name for the model object.

```
import boto3
 import sagemaker
from sagemaker.model import Model
from sagemaker import image_uris, get_execution_role

aws_region = "aws-region"
sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=aws_region))
role = get_execution_role()

def create_model(model_name, model_s3_url):
    # Get the DJL DeepSpeed image uri
    image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sagemaker_session.boto_session.region_name,
        version="0.20.0"
    )
    model = Model(
        image_uri=image_uri,
        model_data=model_s3_url,
        role=role,
        name=model_name,
        sagemaker_session=sagemaker_session,
    )
    return model
```

## Use multi-model endpoints to deploy with DJL Serving
<a name="deploy-models-frameworks-djl-mme"></a>

If you want to deploy multiple models to an endpoint, SageMaker AI offers multi-model endpoints, which are a scalable and cost-effective solution to deploying large numbers of models. DJL Serving also supports loading multiple models simultaneously and running inference on each of the models concurrently. DJL Serving containers adhere to the SageMaker AI multi-model endpoints contracts and can be used to deploy multi-model endpoints.

Each individual model artifact needs to be packaged in the same way as described in the previous section [Prepare your model artifacts](#deploy-models-frameworks-djl-artifacts). You can set model-specific configurations in the `serving.properties` file and model-specific inference handler code in `model.py`. For a multi-model endpoint, models need to be arranged in the following way:

```
 root_dir
        |-- model_1.tar.gz
        |-- model_2.tar.gz
        |-- model_3.tar.gz
            .
            .
            .
```

The Amazon SageMaker Python SDK uses the [MultiDataModel](https://sagemaker.readthedocs.io/en/stable/api/inference/multi_data_model.html) object to instantiate a multi-model endpoint. The Amazon S3 URI for the root directory should be passed as the `model_data_prefix` argument to the `MultiDataModel` constructor.

DJL Serving also provides several configuration parameters to manage model memory requirements, such as `required_memory_mb` and `reserved_memory_mb`, that can be configured for each model in the [serving.properties](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md#servingproperties) file. These parameters are useful to handle out of memory errors more gracefully. For all of the configurable parameters, see [OutofMemory handling in djl-serving](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/out_of_memory_management.md).

The auto scaling feature of DJL Serving makes it easy to ensure that the models are scaled appropriately for incoming traffic. By default, DJL Serving determines the maximum number of workers for a model that can be supported based on the hardware available (such as CPU cores or GPU devices). You can set lower and upper bounds for each model to ensure that a minimum traffic level can always be served, and that a single model does not consume all available resources. You can set the following properties in the [serving.properties](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/modes.md#servingproperties) file:
+ `gpu.minWorkers`: Minimum number of workers for GPUs.
+ `gpu.maxWorkers`: Maximum number of workers for GPUs.
+ `cpu.minWorkers`: Minimum number of workers for CPUs.
+ `cpu.maxWorkers`: Maximum number of workers for CPUs.

For an end-to-end example of how to deploy a multi-model endpoint on SageMaker AI using a DJL Serving container, see the example notebook [Multi-Model-Inference-Demo.ipynb](https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/Multi-Model-Inference-Demo.ipynb).

# Model deployment with Triton Inference Server
<a name="deploy-models-frameworks-triton"></a>

[Triton Inference Server](https://github.com/triton-inference-server/server) is an open source inference serving software that streamlines AI inference. With Triton, you can deploy any model built with multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more.

The SageMaker AI Triton containers help you deploy Triton Inference Server on the SageMaker AI Hosting platform to serve trained models in production. It supports the different modes in which SageMaker AI operates. For a list of available Triton Inference Server containers available on SageMaker AI, see [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). 

For end-to-end notebook examples, we recommend taking a look at the [amazon-sagemaker-examples repository](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-triton).

## Hosting modes
<a name="deploy-models-frameworks-triton-modes"></a>

The following SageMaker AI Hosting modes are supported by Triton containers:
+ Single model endpoints
  + This is SageMaker AI’s default mode of operation. In this mode, the Triton container can load a single model, or a single ensemble model.
  + The name of the model must be passed as as a property of the container environment, which is part of the `CreateModel` SageMaker AI API call. The environment variable used to pass in the model name is `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`.
+ Single model endpoints with ensemble
  + Triton Inference Server supports *ensemble*, which is a pipeline, or a DAG (directed acyclic graph) of models. While an ensemble technically comprises of multiple models, in the default single model endpoint mode, SageMaker AI can treat the *ensemble proper* (the meta-model that represents the pipeline) as the main model to load, and can subsequently load the associated models.
  + The ensemble proper’s model name must be used to load the model. It must be passed as a property of the container environment, which is part of the `CreateModel` SageMaker API call. The environment variable used to pass in the model name is `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`.
+ Multi-model endpoints
  + In this mode, SageMaker AI can serve multiple models on a single endpoint. You can use this mode by specifying the environment variable `‘MultiModel’: true` as a property of the container environment, which is part of the `CreateModel` SageMaker API call.
  + By default, no model is loaded when the instance starts. To run an inference request against a particular model, specify the corresponding model's `*.tar.gz` file as an argument to the `TargetModel` property of the `InvokeEndpoint` SageMaker API call.
+ Multi-model endpoints with ensemble
  + In this mode, SageMaker AI functions as described for multi-model endpoints. However, the SageMaker AI Triton container can load multiple ensemble models, meaning that multiple model pipelines can run on the same instance. SageMaker AI treats every ensemble as one model, and the ensemble proper of each model can be invoked by specifying the corresponding `*.tar.gz` archive as the `TargetModel`.
  + For better memory management during dynamic memory `LOAD` and `UNLOAD`, we recommend that you keep the ensemble size small.

## Inference payload types
<a name="deploy-models-frameworks-triton-payloads"></a>

Triton supports two methods of sending an inference payload over the network – `json` and `binary+json` (or binary encoded json). The JSON payload in both cases includes the datatype, shape and the actual inference request tensor. The request tensor must be a binary tensor.

With the `binary+json` format, you must specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. In the SageMaker AI Triton container, this is done using a custom `Content-Type` header: `application/vnd.sagemaker-triton.binary+json;json-header-size={}`. This is different from using the `Inference-Header-Content-Length` header on a stand-alone Triton Inference Server because custom headers are not allowed in SageMaker AI.

## Using config.pbtxt to set the model config
<a name="deploy-models-frameworks-triton-config"></a>

For Triton Inference Servers on SageMaker AI, each model must include a `config.pbtxt` file that specifies, at a minimum, the following configurations for the model:
+ `name`: While this is optional for models running outside of SageMaker AI, we recommend that you always provide a name for the models to be run in Triton on SageMaker AI.
+ [`platform` and/or `backend`](https://github.com/triton-inference-server/backend/blob/main/README.md#backends): Setting a backend is essential to specify the type of the model. Some backends have further classification, such as `tensorflow_savedmodel` or ` tensorflow_graphdef`. Such options can be specified as part of the `platform` key in addition to the `backend` key. The most common backends are `tensorrt`, `onnxruntime`, `tensorflow`, `pytorch`, `python`, `dali`, `fil`, and `openvino`.
+ `input`: Specify three attributes for the input: `name`, `data_type` and `dims` (the shape).
+ `output`: Specify three attributes for the output: `name`, `data_type` and `dims` (the shape).
+ `max_batch_size`: Set the batch size to a value greater than or equal to 1 that indicates the maximum batch size that Triton should use with the model.

For more details on configuring `config.pbtxt`, see Triton’s GitHub [repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md). Triton provides several configurations for tweaking model behavior. Some of the most common and important configuration options are:
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups): Instance groups help with specifying the number and location for a given model. They have the attributes `count`, `kind`, and `gpus` (used when `kind` is `KIND_GPU`). The `count` attribute is equivalent to the number of workers. For regular model serving, each worker has its own copy of the model. Similarly, in Triton, the `count` specifies the number of model copies per device. For example, if the `instance_group` type is `KIND_CPU`, then the CPU has `count` number of model copies.
**Note**  
On a GPU instance, the `instance_group` configuration applies per GPU device. For example, `count` number of model copies are placed on each GPU device unless you explicitly specify which GPU devices should load the model.
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) and [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models): Dynamic batching is used for stateless models, and sequence batching is used for stateful models (where you want to route a request to the same model instance every time). Batching schedulers enable a per-model queue, which help in increasing throughput, depending on the batching configuration.
+ [https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models): An ensemble model represents a *pipeline* of one or more models and the connection of input and output tensors between those models. It can be configured by specifying `platform` as `ensemble`. The ensemble configuration is just a representation of the model pipeline. On SageMaker AI, all the models under an ensemble are treated as dependents of the ensemble model and are counted as a single model for SageMaker AI metrics, such as `LoadedModelCount`.

## Publishing default Triton metrics to Amazon CloudWatch
<a name="deploy-models-frameworks-triton-metrics"></a>

The NVIDIA Triton Inference Container exposes metrics at port 8002 (configurable) for the different models and GPUs that are utilized in the Triton Inference Server. For full details of the default metrics that are available, see the GitHub page for the [Triton Inference Server metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md). These metrics are in Prometheus format and can be scraped using a Prometheus scraper configuration.

Starting with version v23.07 onwards, the SageMaker AI Triton container supports publishing these metrics to Amazon CloudWatch by specifying a few environment variables. In order to scrape the Prometheus metrics, the SageMaker AI Triton container leverages the Amazon CloudWatch agent.

The required environment variables that you must specify to collect metrics are as follows:


| Environment variable | Description | Example value | 
| --- | --- | --- | 
|  `SAGEMAKER_TRITON_ALLOW_METRICS`  |  Specify this option to allow Triton to publish metrics to its Prometheus endpoint.  | "true" | 
|  `SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH`  |  Specify this option to start the pre-checks necessary to publish metrics to Amazon CloudWatch.  | "true" | 
|  `SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP`  |  Specify this option to point to the log group to which metrics are written.  | "/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesTest" | 
|  `SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE`  |  Specify this option to point to the metric namespace where you want to see and plot the metrics.  | "/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesPublicTest" | 
|  `SAGEMAKER_TRITON_METRICS_PORT`  |  Specify this as 8002, or any other port. If SageMaker AI has not blocked the specified port, it is used. Otherwise, another non-blocked port is chosen automatically.  | "8002" | 

When publishing metrics with Triton on SageMaker AI, keep in mind the following limitations:
+ While you can generate custom metrics through the C-API and Python backend (v23.05 onwards), these are currently not supported for publishing to Amazon CloudWatch.
+ In SageMaker AI multi-model endpoints (MME) mode, Triton runs in an environment that requires model namespacing to be enabled because each model (except ensemble models) are treated as if they are in their own model repository. Currently, this creates a limitation for metrics. When model namespacing is enabled, Triton does not distinguish the metrics between two models with the same name belonging to different ensembles. As a workaround, make sure that every model being deployed has a unique name. This also makes it easier to look up your metrics in CloudWatch.

## Environment variables
<a name="deploy-models-frameworks-triton-variables"></a>

The following table lists the supported environment variables for Triton on SageMaker AI.


| Environment variable | Description | Type | Possible values | 
| --- | --- | --- | --- | 
| `SAGEMAKER_MULTI_MODEL` | Allows Triton to operate in SageMaker AI multi-model endpoints mode. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` | Specify the model to be loaded in the SageMaker AI single model (default) mode. For ensemble mode, specify the name of the ensemble proper. | String | *<model\$1name>* as specified in config.pbtxt | 
| `SAGEMAKER_TRITON_PING_MODE` | `'ready'` is the default mode in SageMaker AI's single model mode, and `'live'` is the default in SageMaker AI's multi-model endpoints mode. | String | `ready`, `live` | 
| `SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING` | In the SageMaker AI Triton container, this is set to `true` by default. | Boolean | `true`, `false` | 
| `SAGEMAKER_BIND_TO_PORT` | While on SageMaker AI, the default port is 8080. You can customize to a different port in multi-container scenarios. | String | *<port\$1number>* | 
| `SAGEMAKER_SAFE_PORT_RANGE` | This is set by the SageMaker AI platform when using multi-container mode. | String | *<port\$11>*–*<port\$12>* | 
| `SAGEMAKER_TRITON_ALLOW_GRPC` | While SageMaker AI doesn't support GRPC currently, if you're using Triton in front of a custom reverse proxy, you may choose to enable GRPC. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_GRPC_PORT` | The default port for GRPC is 8001, but you can change it. | String | *<port\$1number>* | 
| `SAGEMAKER_TRITON_THREAD_COUNT` | You can set the number of default HTTP request handler threads. | String | *<number>* | 
| `SAGEMAKER_TRITON_LOG_VERBOSE` | `true` by default on SageMaker AI, but you can selectively turn this option off. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_INFO` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_WARNING` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_LOG_ERROR` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE` | Specify the shm size for the Python backend, in bytes. The default value is 16 MB but can be increased. | String | *<number>* | 
| `SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE` | Specify the shm growth size for the Python backend, in bytes. The default value is 1 MB but can be increased to allow greater increments. | String | *<number>* | 
| `SAGEMAKER_TRITON_TENSORFLOW_VERSION` | The default value is `2`. Triton no longer supports Tensorflow 2 from Triton v23.04. You can configure this variable for previous versions. | String | *<number>* | 
| `SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT` | Restrict the maximum GPU memory percentage which is used for model loading, allowing the remainder to be used for the inference requests. | String | *<number>* | 
| `SAGEMAKER_TRITON_ALLOW_METRICS` | `false` by default on SageMaker AI. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_METRICS_PORT` | The default port is 8002. | String | *<number>* | 
| `SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH` | `false` by default on SageMaker AI. Set this variable to `true` to allow pushing Triton default metrics to Amazon CloudWatch. If this option is enabled, you are responsible for CloudWatch costs when metrics are published to your account. | Boolean | `true`, `false` | 
| `SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP` | Required if you've enabled metrics publishing to CloudWatch. | String | *<cloudwatch\$1log\$1group\$1name>* | 
| `SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE` | Required if you've enabled metrics publishing to CloudWatch. | String | *<cloudwatch\$1metric\$1namespace>* | 
| `SAGEMAKER_TRITON_ADDITIONAL_ARGS` | Appends any additional arguments when starting the Triton Server. | String | *<additional\$1args>* |