

# Amazon SageMaker Inference Recommender
<a name="inference-recommender"></a>

Amazon SageMaker Inference Recommender is a capability of Amazon SageMaker AI. It reduces the time required to get machine learning (ML) models in production by automating load testing and model tuning across SageMaker AI ML instances. You can use Inference Recommender to deploy your model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost. Inference Recommender helps you select the best instance type and configuration for your ML models and workloads. It considers factors like instance count, container parameters, model optimizations, max concurrency, and memory size.

Amazon SageMaker Inference Recommender only charges you for the instances used while your jobs are executing.

## How it Works
<a name="inference-recommender-how-it-works"></a>

To use Amazon SageMaker Inference Recommender, you can either [create a SageMaker AI model](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) or register a model to the SageMaker Model Registry with your model artifacts. Use the AWS SDK for Python (Boto3) or the SageMaker AI console to run benchmarking jobs for different SageMaker AI endpoint configurations. Inference Recommender jobs help you collect and visualize metrics across performance and resource utilization to help you decide on which endpoint type and configuration to choose.

## How to Get Started
<a name="inference-recommender-get-started"></a>

If you are a first-time user of Amazon SageMaker Inference Recommender, we recommend that you do the following:

1. Read through the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md) section to make sure you have satisfied the requirements to use Amazon SageMaker Inference Recommender.

1. Read through the [Recommendation jobs with Amazon SageMaker Inference Recommender](inference-recommender-recommendation-jobs.md) section to launch your first Inference Recommender recommendation jobs.

1. Explore the introductory Amazon SageMaker Inference Recommender [Jupyter notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-inference-recommender/inference-recommender.ipynb) example, or review the example notebooks in the following section.

## Example notebooks
<a name="inference-recommender-notebooks"></a>

The following example Jupyter notebooks can help you with the workflows for multiple use cases in Inference Recommender:
+ If you want an introductory notebook that benchmarks a TensorFlow model, see the [SageMaker Inference Recommender TensorFlow](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/inference-recommender.ipynb) notebook.
+ If you want to benchmark a HuggingFace model, see the [SageMaker Inference Recommender for HuggingFace](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/huggingface-inference-recommender/huggingface-inference-recommender.ipynb) notebook.
+ If you want to benchmark an XGBoost model, see the [SageMaker Inference Recommender XGBoost](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/xgboost/xgboost-inference-recommender.ipynb) notebook.
+ If you want to review CloudWatch metrics for your Inference Recommender jobs, see the [SageMaker Inference Recommender CloudWatch metrics](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/tensorflow-cloudwatch/tf-cloudwatch-inference-recommender.ipynb) notebook.

# Prerequisites for using Amazon SageMaker Inference Recommender
<a name="inference-recommender-prerequisites"></a>

Before you can use Amazon SageMaker Inference Recommender, you must complete the prerequisite steps. As an example, we show how to use a PyTorch (v1.7.1) ResNet-18 pre-trained model for both types of Amazon SageMaker Inference Recommender recommendation jobs. The examples shown use the AWS SDK for Python (Boto3).

**Note**  
The following code examples use Python. Remove the `!` prefix character if you run any of the following code samples in your terminal or AWS CLI.
You can run the following examples with the Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized) kernel in an Amazon SageMaker Studio notebook. For more information about Studio, see [Amazon SageMaker Studio](studio-updated.md).

1. **Create an IAM role for Amazon SageMaker AI.**

   Create an IAM role for Amazon SageMaker AI that has the `AmazonSageMakerFullAccess` IAM managed policy attached.

1. **Set up your environment.**

   Import dependencies and create variables for your AWS Region, your SageMaker AI IAM role (from Step 1), and the SageMaker AI client.

   ```
   !pip install --upgrade pip awscli botocore boto3  --quiet
   from sagemaker import get_execution_role, Session, image_uris
   import boto3
   
   region = boto3.Session().region_name
   role = get_execution_role()
   sagemaker_client = boto3.client("sagemaker", region_name=region)
   sagemaker_session = Session()
   ```

1. **(Optional) Review existing models benchmarked by Inference Recommender.**

   Inference Recommender benchmarks models from popular model zoos. Inference Recommender supports your model even if it is not already benchmarked.

   Use `ListModelMetaData` to get a response object that lists the domain, framework, task, and model name of machine learning models found in common model zoos.

   You use the domain, framework, framework version, task, and model name in later steps to both select an inference Docker image and register your model with SageMaker Model Registry. The following demonstrates how to list model metadata with SDK for Python (Boto3): 

   ```
   list_model_metadata_response=sagemaker_client.list_model_metadata()
   ```

   The output includes model summaries (`ModelMetadataSummaries`) and response metadata (`ResponseMetadata`) similar to the following example:

   ```
   {
       'ModelMetadataSummaries': [{
               'Domain': 'NATURAL_LANGUAGE_PROCESSING',
               'Framework': 'PYTORCH:1.6.0',
                'Model': 'bert-base-cased',
                'Task': 'FILL_MASK'
                },
               {
                'Domain': 'NATURAL_LANGUAGE_PROCESSING',
                'Framework': 'PYTORCH:1.6.0',
                'Model': 'bert-base-uncased',
                'Task': 'FILL_MASK'
                },
               {
               'Domain': 'COMPUTER_VISION',
                'Framework': 'MXNET:1.8.0',
                'Model': 'resnet18v2-gluon',
                'Task': 'IMAGE_CLASSIFICATION'
                },
                {
                'Domain': 'COMPUTER_VISION',
                'Framework': 'PYTORCH:1.6.0',
                'Model': 'resnet152',
                'Task': 'IMAGE_CLASSIFICATION'
                }],
       'ResponseMetadata': {
                               'HTTPHeaders': {
                               'content-length': '2345',
                               'content-type': 'application/x-amz-json-1.1',
                               'date': 'Tue, 19 Oct 2021 20:52:03 GMT',
                               'x-amzn-requestid': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
                             },
       'HTTPStatusCode': 200,
       'RequestId': 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
       'RetryAttempts': 0
       }
   }
   ```

   For this demo, we use a PyTorch (v1.7.1) ResNet-18 model to perform image classification. The following Python code sample stores the framework, framework version, domain, and task into variables for later use:

   ```
   # ML framework details
   framework = 'pytorch'
   framework_version = '1.7.1'
   
   # ML model details
   ml_domain = 'COMPUTER_VISION'
   ml_task = 'IMAGE_CLASSIFICATION'
   ```

1. **Upload your machine learning model to Amazon S3.**

   Use this PyTorch (v1.7.1) ResNet-18 model if you do not have a pre-trained machine learning model:

   ```
   # Optional: Download a sample PyTorch model
   import torch
   from torchvision import models, transforms, datasets
   
   # Create an example input for tracing
   image = torch.zeros([1, 3, 256, 256], dtype=torch.float32)
   
   # Load a pretrained resnet18 model from TorchHub
   model = models.resnet18(pretrained=True)
   
   # Tell the model we are using it for evaluation (not training). Note this is required for Inferentia compilation.
   model.eval()
   model_trace = torch.jit.trace(model, image)
   
   # Save your traced model
   model_trace.save('model.pth')
   ```

   Download a sample inference script `inference.py`. Create a `code` directory and move the inference script to the `code` directory.

   ```
   # Download the inference script
   !wget https://aws-ml-blog-artifacts.s3.us-east-2.amazonaws.com/inference.py
   
   # move it into a code/ directory
   !mkdir code
   !mv inference.py code/
   ```

   Amazon SageMaker AI requires pre-trained machine learning models to be packaged as a compressed TAR file (`*.tar.gz`). Compress your model and inference script to satisfy this requirement:

   ```
   !tar -czf test.tar.gz model.pth code/inference.py
   ```

   When your endpoint is provisioned, the files in the archive are extracted to `/opt/ml/model/` on the endpoint.

   After you compress your model and model artifacts as a `.tar.gz` file, upload them to your Amazon S3 bucket. The following example demonstrates how to upload your model to Amazon S3 using the AWS CLI:

   ```
   !aws s3 cp test.tar.gz s3://{your-bucket}/models/
   ```

1. **Select a prebuilt Docker inference image or create your own Inference Docker Image.**

   SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker AI images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

   If none of the existing SageMaker AI containers meet your needs and you don't have an existing container of your own, create a new Docker image. See [Containers with custom inference code](your-algorithms-inference-main.md) for information about how to create your Docker image.

   The following demonstrates how to retrieve a PyTorch version 1.7.1 inference image using the SageMaker Python SDK:

   ```
   from sagemaker import image_uris
   
   ## Uncomment and replace with your own values if you did not define  
   ## these variables a previous step.
   #framework = 'pytorch'
   #framework_version = '1.7.1'
   
   # Note: you can use any CPU-based instance here, 
   # this is just to set the arch as CPU for the Docker image
   instance_type = 'ml.m5.2xlarge' 
   
   image_uri = image_uris.retrieve(framework, 
                                   region, 
                                   version=framework_version, 
                                   py_version='py3', 
                                   instance_type=instance_type, 
                                   image_scope='inference')
   ```

   For a list of available SageMaker AI Instances, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

1. **Create a sample payload archive.**

   Create an archive that contains individual files that the load testing tool can send to your SageMaker AI endpoints. Your inference code must be able to read the file formats from the sample payload.

   The following downloads a .jpg image that this example uses in a later step for the ResNet-18 model.

   ```
   !wget https://cdn.pixabay.com/photo/2020/12/18/05/56/flowers-5841251_1280.jpg
   ```

   Compress the sample payload as a tarball:

   ```
   !tar -cvzf payload.tar.gz flowers-5841251_1280.jpg
   ```

   Upload the sample payload to Amazon S3 and note the Amazon S3 URI:

   ```
   !aws s3 cp payload.tar.gz s3://{bucket}/models/
   ```

   You need the Amazon S3 URI in a later step, so store it in a variable:

   ```
   bucket_prefix='models'
   bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
   payload_s3_key = f"{bucket_prefix}/payload.tar.gz"
   sample_payload_url= f"s3://{bucket}/{payload_s3_key}"
   ```

1. **Prepare your model input for the recommendations job**

   For the last prerequisite, you have two options to prepare your model input. You can either register your model with SageMaker Model Registry, which you can use to catalog models for production, or you can create a SageMaker AI model and specify it in the `ContainerConfig` field when creating a recommendations job. The first option is best if you want to take advantage of the features that [Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) provides, such as managing model versions and automating model deployment. The second option is ideal if you want to get started quickly. For the first option, go to step 7. For the second option, skip step 7 and go to step 8.

1. **Option 1: Register your model in the model registry**

   With SageMaker Model Registry, you can catalog models for production, manage model versions, associate metadata (such as training metrics) with a model, manage the approval status of a model, deploy models to production, and automate model deployment with CI/CD.

   When you use SageMaker Model Registry to track and manage your models, they are represented as a versioned model package within model package groups. Unversioned model packages are not part of a model group. Model package groups hold multiple versions or iterations of a model. Though it is not required to create them for every model in the registry, they help organize various models that all have the same purpose and provide automatic versioning.

   To use Amazon SageMaker Inference Recommender, you must have a versioned model package. You can create a versioned model package programmatically with the AWS SDK for Python (Boto3) or with Amazon SageMaker Studio Classic. To create a versioned model package programmatically, first create a model package group with the `CreateModelPackageGroup` API. Next, create a model package using the `CreateModelPackage` API. Calling this method makes a versioned model package.

   See [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md) for detailed instructions about how to programmatically and interactively create a model package group and how to create a versioned model package, respectively, with the AWS SDK for Python (Boto3) and Amazon SageMaker Studio Classic.

   The following code sample demonstrates how to create a versioned model package using the AWS SDK for Python (Boto3).
**Note**  
You do not need to approve the model package to create an Inference Recommender job.

   1. **Create a model package group**

      Create a model package group with the `CreateModelPackageGroup` API. Provide a name to the model package group for the `ModelPackageGroupName` and optionally provide a description of the model package in the `ModelPackageGroupDescription` field.

      ```
      model_package_group_name = '<INSERT>'
      model_package_group_description = '<INSERT>' 
      
      model_package_group_input_dict = {
       "ModelPackageGroupName" : model_package_group_name,
       "ModelPackageGroupDescription" : model_package_group_description,
      }
      
      model_package_group_response = sagemaker_client.create_model_package_group(**model_package_group_input_dict)
      ```

      See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackageGroup.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackageGroup.html).

      Create a model package by specifying a Docker image that runs your inference code and the Amazon S3 location of your model artifacts and provide values for `InferenceSpecification`. `InferenceSpecification` should contain information about inference jobs that can be run with models based on this model package, including the following:
      + The Amazon ECR paths of images that run your inference code.
      + (Optional) The instance types that the model package supports for transform jobs and real-time endpoints used for inference.
      + The input and output content formats that the model package supports for inference.

      In addition, you must specify the following parameters when you create a model package:
      + [Domain](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Domain): The machine learning domain of your model package and its components. Common machine learning domains include computer vision and natural language processing.
      + [Task](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Task): The machine learning task your model package accomplishes. Common machine learning tasks include object detection and image classification. Specify "OTHER" if none of the tasks listed in the [API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) satisfy your use case. See the [Task](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-Task) API field descriptions for a list of supported machine learning tasks.
      + [SamplePayloadUrl](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-SamplePayloadUrl): The Amazon Simple Storage Service (Amazon S3) path where the sample payload are stored. This path must point to a single GZIP compressed TAR archive (.tar.gz suffix).
      + [Framework](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelPackageContainerDefinition.html#sagemaker-Type-ModelPackageContainerDefinition-Framework): The machine learning framework of the model package container image.
      + [FrameworkVersion](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelPackageContainerDefinition.html#sagemaker-Type-ModelPackageContainerDefinition-FrameworkVersion): The framework version of the model package container image.

      If you provide an allow list of instance types to use to generate inferences in real-time for the [SupportedRealtimeInferenceInstanceTypes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InferenceSpecification.html#sagemaker-Type-InferenceSpecification-SupportedRealtimeInferenceInstanceTypes), Inference Recommender limits the search space for instance types during a `Default` job. Use this parameter if you have budget constraints or know there's a specific set of instance types that can support your model and container image.

      In a previous step, we downloaded a pre-trained ResNet18 model and stored it in an Amazon S3 bucket in a directory called `models`. We retrieved a PyTorch (v1.7.1) Deep Learning Container inference image and stored the URI in a variable called `image_uri`. Use those variables in the following code sample to define a dictionary used as input to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html) API.

      ```
      # Provide the Amazon S3 URI of your compressed tarfile
      # so that Model Registry knows where to find your model artifacts
      bucket_prefix='models'
      bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
      model_s3_key = f"{bucket_prefix}/test.tar.gz"
      model_url= f"s3://{bucket}/{model_s3_key}"
      
      # Similar open source model to the packaged model
      # The name of the ML model as standardized by common model zoos
      nearest_model_name = 'resnet18'
      
      # The supported MIME types for input and output data. In this example, 
      # we are using images as input.
      input_content_type='image/jpeg'
      
      
      # Optional - provide a description of your model.
      model_package_description = '<INSERT>'
      
      ## Uncomment if you did not store the domain and task in an earlier
      ## step 
      #ml_domain = 'COMPUTER_VISION'
      #ml_task = 'IMAGE_CLASSIFICATION'
      
      ## Uncomment if you did not store the framework and framework version
      ## in a previous step.
      #framework = 'PYTORCH'
      #framework_version = '1.7.1'
      
      # Optional: Used for optimizing your model using SageMaker Neo
      # PyTorch uses NCHW format for images
      data_input_configuration = "[[1,3,256,256]]"
      
      # Create a dictionary to use as input for creating a model pacakge group
      model_package_input_dict = {
              "ModelPackageGroupName" : model_package_group_name,
              "ModelPackageDescription" : model_package_description,
              "Domain": ml_domain,
              "Task": ml_task,
              "SamplePayloadUrl": sample_payload_url,
              "InferenceSpecification": {
                      "Containers": [
                          {
                              "Image": image_uri,
                              "ModelDataUrl": model_url,
                              "Framework": framework.upper(), 
                              "FrameworkVersion": framework_version,
                              "NearestModelName": nearest_model_name,
                              "ModelInput": {"DataInputConfig": data_input_configuration}
                          }
                          ],
                      "SupportedContentTypes": [input_content_type]
              }
          }
      ```

   1. **Create a model package**

      Use the `CreateModelPackage` API to create a model package. Pass the input dictionary defined in the previous step:

      ```
      model_package_response = sagemaker_client.create_model_package(**model_package_input_dict)
      ```

      You need the model package ARN to use Amazon SageMaker Inference Recommender. Note the ARN of the model package or store it in a variable:

      ```
      model_package_arn = model_package_response["ModelPackageArn"]
      
      print('ModelPackage Version ARN : {}'.format(model_package_arn))
      ```

1. **Option 2: Create a model and configure the `ContainerConfig` field**

   Use this option if you want to start an inference recommendations job and don't need to register your model in the Model Registry. In the following steps, you create a model in SageMaker AI and configure the `ContainerConfig` field as input for the recommendations job.

   1. **Create a model**

      Create a model with the `CreateModel` API. For an example that calls this method when deploying a model to SageMaker AI Hosting, see [Create a Model (AWS SDK for Python (Boto3))](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html#realtime-endpoints-deployment-create-model).

      In a previous step, we downloaded a pre-trained ResNet18 model and stored it in an Amazon S3 bucket in a directory called `models`. We retrieved a PyTorch (v1.7.1) Deep Learning Container inference image and stored the URI in a variable called `image_uri`. We use those variables in the following code example where we define a dictionary used as input to the `[CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html#sagemaker-CreateModel-request-ModelName)` API.

      ```
      model_name = '<name_of_the_model>'
      # Role to give SageMaker permission to access AWS services.
      sagemaker_role= "arn:aws:iam::<region>:<account>:role/*"
      
      # Provide the Amazon S3 URI of your compressed tarfile
      # so that Model Registry knows where to find your model artifacts
      bucket_prefix='models'
      bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
      model_s3_key = f"{bucket_prefix}/test.tar.gz"
      model_url= f"s3://{bucket}/{model_s3_key}"
      
      #Create model
      create_model_response = sagemaker_client.create_model(
          ModelName = model_name,
          ExecutionRoleArn = sagemaker_role, 
          PrimaryContainer = {
              'Image': image_uri,
              'ModelDataUrl': model_url,
          })
      ```

   1. **Configure the `ContainerConfig` field**

      Next, you must configure the [ContainerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RecommendationJobInputConfig.html#sagemaker-Type-RecommendationJobInputConfig-ContainerConfig) field with the model you just created and specify the following parameters in it:
      + `Domain`: The machine learning domain of the model and its components, such as computer vision or natural language processing.
      + `Task`: The machine learning task that the model accomplishes, such as image classification or object detection.
      + `PayloadConfig`: The configuration for the payload for a recommendation job. For more information about the subfields, see `[RecommendationJobPayloadConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RecommendationJobPayloadConfig.html#sagemaker-Type-RecommendationJobPayloadConfig-SamplePayloadUrl)`.
      + `Framework`: The machine learning framework of the container image, such as PyTorch.
      + `FrameworkVersion`: The framework version of the container image.
      + (Optional) `SupportedInstanceTypes`: A list of the instance types that are used to generate inferences in real-time.

      If you use the `SupportedInstanceTypes` parameter, Inference Recommender limits the search space for instance types during a `Default` job. Use this parameter if you have budget constraints or know there's a specific set of instance types that can support your model and container image.

      In the following code example, we use the previously defined parameters, along with `NearestModelName`, to define a dictionary used as input to the `[CreateInferenceRecommendationsJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html)` API.

      ```
      ## Uncomment if you did not store the domain and task in a previous step
      #ml_domain = 'COMPUTER_VISION'
      #ml_task = 'IMAGE_CLASSIFICATION'
      
      ## Uncomment if you did not store the framework and framework version in a previous step
      #framework = 'PYTORCH'
      #framework_version = '1.7.1'
      
      # The name of the ML model as standardized by common model zoos
      nearest_model_name = 'resnet18'
      
      # The supported MIME types for input and output data. In this example, 
      # we are using images as input
      input_content_type='image/jpeg'
      
      # Optional: Used for optimizing your model using SageMaker Neo
      # PyTorch uses NCHW format for images
      data_input_configuration = "[[1,3,256,256]]"
      
      # Create a dictionary to use as input for creating an inference recommendation job
      container_config = {
              "Domain": ml_domain,
              "Framework": framework.upper(), 
              "FrameworkVersion": framework_version,
              "NearestModelName": nearest_model_name,
              "PayloadConfig": { 
                  "SamplePayloadUrl": sample_payload_url,
                  "SupportedContentTypes": [ input_content_type ]
               },
              "DataInputConfig": data_input_configuration
              "Task": ml_task,
              }
      ```

# Recommendation jobs with Amazon SageMaker Inference Recommender
<a name="inference-recommender-recommendation-jobs"></a>

Amazon SageMaker Inference Recommender can make two types of recommendations:

1. Inference recommendations (`Default` job type) run a set of load tests on the recommended instance types. You can also load test for a serverless endpoint.. You only need to provide a model package Amazon Resource Name (ARN) to launch this type of recommendation job. Inference recommendation jobs complete within 45 minutes.

1. Endpoint recommendations (`Advanced` job type) are based on a custom load test where you select your desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements. This job takes an average of 2 hours to complete depending on the job duration set and the total number of inference configurations tested.

Both types of recommendations use the same APIs to create, describe, and stop jobs. The output is a list of instance configuration recommendations with associated environment variables, cost, throughput, and latency metrics. Recommendation jobs also provide an initial instance count, which you can use to configure an autoscaling policy. To differentiate between the two types of jobs, when you’re creating a job through either the SageMaker AI console or the APIs, specify `Default` to create preliminary endpoint recommendations and `Advanced` for custom load testing and endpoint recommendations.

**Note**  
You do not need to do both types of recommendation jobs in your own workflow. You can do either independently of the other.

Inference Recommender can also provide you with a list of prospective instances, or the top five instance types that are optimized for cost, throughput and latency for model deployment, along with a confidence score. You can choose these instances when deploying your model. Inference Recommender automatically performs benchmarking against your model for you to provide the prospective instances. Since these are preliminary recommendations, we recommend that you run further instance recommendation jobs to get more accurate results. To view the prospective instances, go to your SageMaker AI model details page. For more information, see [Get instant prospective instances](inference-recommender-prospective.md).

**Topics**
+ [Get instant prospective instances](inference-recommender-prospective.md)
+ [Inference recommendations](inference-recommender-instance-recommendation.md)
+ [Get an inference recommendation for an existing endpoint](inference-recommender-existing-endpoint.md)
+ [Stop your inference recommendation](instance-recommendation-stop.md)
+ [Compiled recommendations with Neo](inference-recommender-neo-compilation.md)
+ [Recommendation results](inference-recommender-interpret-results.md)
+ [Get autoscaling policy recommendations](inference-recommender-autoscaling.md)
+ [Run a custom load test](inference-recommender-load-test.md)
+ [Stop your load test](load-test-stop.md)
+ [Troubleshoot Inference Recommender errors](inference-recommender-troubleshooting.md)

# Get instant prospective instances
<a name="inference-recommender-prospective"></a>

Inference Recommender can also provide you with a list of *prospective instances*, or instance types that might be suitable for your model, on your SageMaker AI model details page. Inference Recommender automatically performs preliminary benchmarking against your model for you to provide the top five prospective instances. Since these are preliminary recommendations, we recommend that you run further instance recommendation jobs to get more accurate results.

You can view a list of prospective instances for your model either programmatically by using the [DescribeModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModel.html) API, the SageMaker Python SDK, or the SageMaker AI console.

**Note**  
You won’t get prospective instances for models that you created in SageMaker AI before this feature became available.

To view the prospective instances for your model through the console, do the following:

1. Go to the SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Models**.

1. From the list of models, choose your model.

On the details page for your model, go to the **Prospective instances to deploy model** section. The following screenshot shows this section.

![\[Screenshot of the list of prospective instances on the model details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-prospective.png)


In this section, you can view the prospective instances that are optimized for cost, throughput, and latency for model deployment, along with additional information for each instance type such as the memory size, CPU and GPU count, and cost per hour.

If you decide that you want to benchmark a sample payload and run a full inference recommendation job for your model, you can start a default inference recommendation job from this page. To start a default job through the console, do the following:

1. On your model details page in the **Prospective instances to deploy model section**, choose **Run Inference recommender job**.

1. In the dialog box that pops up, for **S3 bucket for benchmarking payload**, enter the Amazon S3 location where you’ve stored a sample payload for your model.

1. For **Payload content type**, enter the MIME types for your payload data.

1. (Optional) In the **Model compilation using SageMaker Neo** section, for the **Data input configuration**, enter a data shape in dictionary format.

1. Choose **Run job**.

Inference Recommender starts the job, and you can view the job and its results from the **Inference recommender** list page in the SageMaker AI console.

If you want to run an advanced job and perform custom load tests, or if you want to configure additional settings and parameters for your job, see [Run a custom load test](inference-recommender-load-test.md).

# Inference recommendations
<a name="inference-recommender-instance-recommendation"></a>

Inference recommendation jobs run a set of load tests on recommended instance types or a serverless endpoint. Inference recommendation jobs use performance metrics that are based on load tests using the sample data you provided during model version registration.

**Note**  
Before you create an Inference Recommender recommendation job, make sure you have satisfied the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md).

The following demonstrates how to use Amazon SageMaker Inference Recommender to create an inference recommendation based on your model type using the AWS SDK for Python (Boto3), AWS CLI, and Amazon SageMaker Studio Classic, and the SageMaker AI console

**Topics**
+ [Create an inference recommendation](instance-recommendation-create.md)
+ [Get your inference recommendation job results](instance-recommendation-results.md)

# Create an inference recommendation
<a name="instance-recommendation-create"></a>

Create an inference recommendation programmatically using the AWS SDK for Python (Boto3) or the AWS CLI, or interactively using Studio Classic or the SageMaker AI console. Specify a job name for your inference recommendation, an AWS IAM role ARN, an input configuration, and either a model package ARN when you registered your model with the model registry, or your model name and a `ContainerConfig` dictionary from when you created your model in the **Prerequisites** section.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html) API to start an inference recommendation job. Set the `JobType` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ A model package ARN or model name. Inference Recommender supports either one model package ARN or a model name as input. Specify one of the following:
  + The ARN of the versioned model package you created when you registered your model with SageMaker AI model registry. Define this for `ModelPackageVersionArn` in the `InputConfig` field.
  + The name of the model you created. Define this for `ModelName` in the `InputConfig` field. Also, provide the `ContainerConfig` dictionary, which includes the required fields that need to be provided with the model name. Define this for `ContainerConfig` in the `InputConfig` field. In the `ContainerConfig`, you can also optionally specify the `SupportedEndpointType` field as either `RealTime` or `Serverless`. If you specify this field, Inference Recommender returns recommendations for only that endpoint type. If you don't specify this field, Inference Recommender returns recommendations for both endpoint types.
+ A name for your Inference Recommender recommendation job for the `JobName` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.

Import the AWS SDK for Python (Boto3) package and create a SageMaker AI client object using the client class. If you followed the steps in the **Prerequisites** section, only specify one of the following:
+ Option 1: If you would like to create an inference recommendations job with a model package ARN, then store the model package group ARN in a variable named `model_package_arn`.
+ Option 2: If you would like to create an inference recommendations job with a model name and `ContainerConfig`, store the model name in a variable named `model_name` and the `ContainerConfig` dictionary in a variable named `container_config`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<INSERT>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide only one of model package ARN or model name, not both.
# Provide your model package ARN that was created when you registered your 
# model with Model Registry 
model_package_arn = '<INSERT>'
## Uncomment if you would like to create an inference recommendations job with a
## model name instead of a model package ARN, and comment out model_package_arn above
## Provide your model name
# model_name = '<INSERT>'
## Provide your container config 
# container_config = '<INSERT>'

# Provide a unique job name for SageMaker Inference Recommender job
job_name = '<INSERT>'

# Inference Recommender job type. Set to Default to get an initial recommendation
job_type = 'Default'

# Provide an IAM Role that gives SageMaker Inference Recommender permission to 
# access AWS services
role_arn = 'arn:aws:iam::<account>:role/*'

sagemaker_client.create_inference_recommendations_job(
    JobName = job_name,
    JobType = job_type,
    RoleArn = role_arn,
    # Provide only one of model package ARN or model name, not both. 
    # If you would like to create an inference recommendations job with a model name,
    # uncomment ModelName and ContainerConfig, and comment out ModelPackageVersionArn.
    InputConfig = {
        'ModelPackageVersionArn': model_package_arn
        # 'ModelName': model_name,
        # 'ContainerConfig': container_config
    }
)
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html).

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to start an inference recommendation job. Set the `job-type` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ The Amazon Resource Name (ARN) of an IAM role that enables Amazon SageMaker Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ A model package ARN or model name. Inference Recommender supports either one model package ARN or a model name as input. Specify one of the following
  + The ARN of the versioned model package you created when you registered your model with Model Registry. Define this for `ModelPackageVersionArn` in the `input-config` field.
  + The name of the model you created. Define this for `ModelName` in the `input-config` field. Also, provide the `ContainerConfig` dictionary which includes the required fields that need to be provided with the model name. Define this for `ContainerConfig` in the `input-config` field. In the `ContainerConfig`, you can also optionally specify the `SupportedEndpointType` field as either `RealTime` or `Serverless`. If you specify this field, Inference Recommender returns recommendations for only that endpoint type. If you don't specify this field, Inference Recommender returns recommendations for both endpoint types.
+ A name for your Inference Recommender recommendation job for the `job-name` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.

To create an inference recommendation jobs with a model package ARN, use the following example:

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region:account:role/*>\",
        }"
```

To create an inference recommendation jobs with a model name and `ContainerConfig`, use the following example. The example uses the `SupportedEndpointType` field to specify that we only want to return real-time inference recommendations:

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelName\": \"model-name\",
        \"ContainerConfig\" : {
                \"Domain\": \"COMPUTER_VISION\",
                \"Framework\": \"PYTORCH\",
                \"FrameworkVersion\": \"1.7.1\",
                \"NearestModelName\": \"resnet18\",
                \"PayloadConfig\": 
                    {
                        \"SamplePayloadUrl\": \"s3://{bucket}/{payload_s3_key}\", 
                        \"SupportedContentTypes\": [\"image/jpeg\"]
                    },
                \"SupportedEndpointType\": \"RealTime\",
                \"DataInputConfig\": \"[[1,3,256,256]]\",
                \"Task\": \"IMAGE_CLASSIFICATION\",
            },
        }"
```

------
#### [ Amazon SageMaker Studio Classic ]

Create an inference recommendation job in Studio Classic.

1. In your Studio Classic application, choose the home icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)).

1. In the left sidebar of Studio Classic, choose **Models**.

1. Choose **Model Registry** from the dropdown list to display models you have registered with the model registry.

   The left panel displays a list of model groups. The list includes all the model groups registered with the model registry in your account, including models registered outside of Studio Classic.

1. Select the name of your model group. When you select your model group, the right pane of Studio Classic displays column heads such as **Versions** and **Setting**.

   If you have one or more model packages within your model group, you see a list of those model packages within the **Versions** column.

1. Choose the **Inference recommender** column.

1. Choose an IAM role that grants Inference Recommender permission to access AWS services. You can create a role and attach the `AmazonSageMakerFullAccess` IAM managed policy to accomplish this. Or you can let Studio Classic create a role for you.

1. Choose **Get recommendations**.

   The inference recommendation can take up to 45 minutes.
**Warning**  
Do not close this tab. If you close this tab, you cancel the instance recommendation job.

------
#### [ SageMaker AI console ]

Create an instance recommendation job through the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose **Create job**.

1. For **Step 1: Model configuration**, do the following:

   1. For **Job type**, choose **Default recommender job**.

   1. If you’re using a model registered in the SageMaker AI model registry, then turn on the **Choose a model from the model registry** toggle and do the following:

      1. From the **Model group** dropdown list, choose the model group in SageMaker AI model registry where your model is located.

      1. From the **Model version** dropdown list, choose the desired version of your model.

   1. If you’re using a model that you’ve created in SageMaker AI, then turn off the **Choose a model from the model registry toggle** and do the following:

      1. For the **Model name** field, enter the name of your SageMaker AI model.

   1. From the **IAM role** dropdown list, you can select an existing AWS IAM role that has the necessary permissions to create an instance recommendation job. Alternatively, if you don’t have an existing role, you can choose **Create a new role** to open the role creation pop-up, and SageMaker AI adds the necessary permissions to the new role that you create.

   1. For **S3 bucket for benchmarking payload**, enter the Amazon S3 path to your sample payload archive, which should contain sample payload files that Inference Recommender uses to benchmark your model on different instance types.

   1. For **Payload content type**, enter the MIME types of your sample payload data.

   1. (Optional) If you turned off the **Choose a model from the model registry toggle** and specified a SageMaker AI model, then for **Container configuration**, do the following:

      1. For the **Domain** dropdown list, select the machine learning domain of the model, such as computer vision, natural language processing, or machine learning.

      1. For the **Framework** dropdown list, select the framework of your container, such as TensorFlow or XGBoost.

      1. For **Framework version**, enter the framework version of your container image.

      1. For the **Nearest model name** dropdown list, select the pre-trained model that mostly closely matches your own.

      1. For the **Task** dropdown list, select the machine learning task that the model accomplishes, such as image classification or regression.

   1. (Optional) For **Model compilation using SageMaker Neo**, you can configure the recommendation job for a model that you’ve compiled using SageMaker Neo. For **Data input configuration**, enter the correct input data shape for your model in a format similar to `{'input':[1,1024,1024,3]}`.

   1. Choose **Next**.

1. For **Step 2: Instances and environment parameters**, do the following:

   1. (Optional) For **Select instances for benchmarking**, you can select up to 8 instance types that you want to benchmark. If you don’t select any instances, Inference Recommender considers all instance types.

   1. Choose **Next**.

1. For **Step 3: Job parameters**, do the following:

   1. (Optional) For the **Job name** field, enter a name for your instance recommendation job. When you create the job, SageMaker AI appends a timestamp to the end of this name.

   1. (Optional) For the **Job description** field, enter a description for the job.

   1. (Optional) For the **Encryption key** dropdown list, choose an AWS KMS key by name or enter its ARN to encrypt your data.

   1. (Optional) For **Max test duration (s)**, enter the maximum number of seconds you want each test to run for.

   1. (Optional) For **Max invocations per minute**, enter the maximum number of requests per minute the endpoint can reach before stopping the recommendation job. After reaching this limit, SageMaker AI ends the job.

   1. (Optional) For **P99 Model latency threshold (ms)**, enter the model latency percentile in milliseconds.

   1. Choose **Next**.

1. For **Step 4: Review job**, review your configurations and then choose **Submit**.

------

# Get your inference recommendation job results
<a name="instance-recommendation-results"></a>

Collect the results of your inference recommendation job programmatically with AWS SDK for Python (Boto3), the AWS CLI, Studio Classic, or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Once an inference recommendation is complete, you can use `DescribeInferenceRecommendationsJob` to get the job details and recommendations. Provide the job name that you used when you created the inference recommendation job.

```
job_name='<INSERT>'
response = sagemaker_client.describe_inference_recommendations_job(
                    JobName=job_name)
```

Print the response object. The previous code sample stored the response in a variable named `response`.

```
print(response['Status'])
```

This returns a JSON response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Default', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 20, 4, 57, 627000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 20, 25, 1, 997000, tzinfo=tzlocal()), 
    'InputConfig': {
                'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
                'JobDurationInSeconds': 0
                }, 
    'InferenceRecommendations': [{
            'Metrics': {
                'CostPerHour': 0.20399999618530273, 
                'CostPerInference': 5.246913588052848e-06, 
                'MaximumInvocations': 648, 
                'ModelLatency': 263596
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5.xlarge', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
         }, 
         {
            'Metrics': {
                'CostPerHour': 0.11500000208616257, 
                'CostPerInference': 2.92620870823157e-06, 
                'MaximumInvocations': 655, 
                'ModelLatency': 826019
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5d.large', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
            }, 
            {
                'Metrics': {
                    'CostPerHour': 0.11500000208616257, 
                    'CostPerInference': 3.3625731248321244e-06, 
                    'MaximumInvocations': 570, 
                    'ModelLatency': 1085446
                    }, 
                'EndpointConfiguration': {
                    'EndpointName': 'endpoint-name', 
                    'VariantName': 'variant-name', 
                    'InstanceType': 'ml.m5.large', 
                    'InitialInstanceCount': 1
                    }, 
                'ModelConfiguration': {
                    'Compiled': False, 
                    'EnvironmentParameters': []
                    }
            }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1685', 
            'date': 'Tue, 26 Oct 2021 20:31:10 GMT'
            }, 
        'RetryAttempts': 0
        }
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and deletion times. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) that was used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for an inference recommendations job configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the metrics returned for a serverless endpoint with the given `MemorySizeInMB` and when `MaxConcurrency = 1`. To increase the throughput possible on the endpoint, increase the value of `MaxConcurrency` linearly. For example, if the inference recommendation shows `MaxInvocations` as `1000`, then increasing `MaxConcurrency` to `2` would support 2000 `MaxInvocations`. Note that this is true only up to a certain point, which can vary based on your model and code. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ AWS CLI ]

Once an inference recommendation is complete, you can use `describe-inference-recommendations-job` to get the job details and recommended instance types. Provide the job name that you used when you created the inference recommendation job.

```
aws sagemaker describe-inference-recommendations-job\
    --job-name <job-name>\
    --region <aws-region>
```

The JSON response similar should resemble the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Default', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 20, 4, 57, 627000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 20, 25, 1, 997000, tzinfo=tzlocal()), 
    'InputConfig': {
                'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
                'JobDurationInSeconds': 0
                }, 
    'InferenceRecommendations': [{
            'Metrics': {
                'CostPerHour': 0.20399999618530273, 
                'CostPerInference': 5.246913588052848e-06, 
                'MaximumInvocations': 648, 
                'ModelLatency': 263596
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5.xlarge', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
         }, 
         {
            'Metrics': {
                'CostPerHour': 0.11500000208616257, 
                'CostPerInference': 2.92620870823157e-06, 
                'MaximumInvocations': 655, 
                'ModelLatency': 826019
                }, 
            'EndpointConfiguration': {
                'EndpointName': 'endpoint-name', 
                'VariantName': 'variant-name', 
                'InstanceType': 'ml.c5d.large', 
                'InitialInstanceCount': 1
                }, 
            'ModelConfiguration': {
                'Compiled': False, 
                'EnvironmentParameters': []
                }
            }, 
            {
                'Metrics': {
                    'CostPerHour': 0.11500000208616257, 
                    'CostPerInference': 3.3625731248321244e-06, 
                    'MaximumInvocations': 570, 
                    'ModelLatency': 1085446
                    }, 
                'EndpointConfiguration': {
                    'EndpointName': 'endpoint-name', 
                    'VariantName': 'variant-name', 
                    'InstanceType': 'ml.m5.large', 
                    'InitialInstanceCount': 1
                    }, 
                'ModelConfiguration': {
                    'Compiled': False, 
                    'EnvironmentParameters': []
                    }
            }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1685', 
            'date': 'Tue, 26 Oct 2021 20:31:10 GMT'
            }, 
        'RetryAttempts': 0
        }
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in milliseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for an inference recommendations job configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the metrics returned for a serverless endpoint with the given `MemorySizeInMB` and when `MaxConcurrency = 1`. To increase the throughput possible on the endpoint, increase the value of `MaxConcurrency` linearly. For example, if the inference recommendation shows `MaxInvocations` as `1000`, then increasing `MaxConcurrency` to `2` would support 2000 `MaxInvocations`. Note that this is true only up to a certain point, which can vary based on your model and code. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ Amazon SageMaker Studio Classic ]

The inference recommendations populate in a new **Inference recommendations** tab within Studio Classic. It can take up to 45 minutes for the results to show up. This tab contains **Results** and **Details** column headings.

The **Details** column provides information about the inference recommendation job, such as the name of the inference recommendation, when the job was created (**Creation time**), and more. It also provides **Settings** information, such as the maximum number of invocations that occurred per minute and information about the Amazon Resource Names used.

The **Results** column provides a ** Deployment goals** and **SageMaker AI recommendations** window in which you can adjust the order that the results are displayed based on deployment importance. There are three dropdown menus that you can use to provide the level of importance of the **Cost**, **Latency**, and **Throughput** for your use case. For each goal (cost, latency, and throughput), you can set the level of importance: **Lowest Importance**, **Low Importance**, **Moderate importance**, **High importance**, or **Highest importance**. 

Based on your selections of importance for each goal, Inference Recommender displays its top recommendation in the **SageMaker recommendation** field on the right of the panel, along with the estimated cost per hour and inference request. It also provides information about the expected model latency, maximum number of invocations, and the number of instances. For serverless recommendations, you can see the ideal values for the maximum concurrency and endpoint memory size.

In addition to the top recommendation displayed, you can also see the same information displayed for all instances that Inference Recommender tested in the **All runs** section.

------
#### [ SageMaker AI console ]

You can view your instance recommendation jobs in the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose the name of your inference recommendation job.

On the details page for your job, you can view the **Inference recommendations**, which are the instance types SageMaker AI recommends for your model, as shown in the following screenshot.

![\[Screenshot of the inference recommendations list on the job details page in the SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-instant-recs.png)


In this section, you can compare the instance types by various factors such as **Model latency**, **Cost per hour**, **Cost per inference**, and **Invocations per minute**.

On this page, you can also view the configurations you specified for your job. In the **Monitor** section, you can view the Amazon CloudWatch metrics that were logged for each instance type. To learn more about interpreting these metrics, see [Interpret results](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html).

------

For more information about interpreting the results of your recommendation job, see [Recommendation results](inference-recommender-interpret-results.md).

# Get an inference recommendation for an existing endpoint
<a name="inference-recommender-existing-endpoint"></a>

Inference recommendation jobs run a set of load tests on recommended instance types and an existing endpoint. Inference recommendation jobs use performance metrics that are based on load tests using the sample data you provided during model version registration.

You can benchmark and get inference recommendations for an existing SageMaker AI Inference endpoint to help you improve the performance of your endpoint. The procedure of getting recommendations for an existing SageMaker AI Inference endpoint is similar to the procedure for [getting inference recommendations](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-instance-recommendation.html) without an endpoint. There are several feature exclusions to take note of when benchmarking an existing endpoint:
+ You can only use one existing endpoint per Inference Recommender job.
+ You can only have one variant on your endpoint.
+ You can’t use an endpoint that enables autoscaling.
+ This functionality is only supported for [Real-Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).
+ This functionality doesn’t support [Real-Time Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html).

**Warning**  
We strongly recommend that you don't run an Inference Recommender job on a production endpoint that handles live traffic. The synthetic load during benchmarking can affect your production endpoint and cause throttling or provide inaccurate benchmark results. We recommend that you use a non-production or developer endpoint for comparison purposes. 

The following sections demonstrate how to use Amazon SageMaker Inference Recommender to create an inference recommendation for an existing endpoint based on your model type using the AWS SDK for Python (Boto3) and the AWS CLI.

**Note**  
Before you create an Inference Recommender recommendation job, make sure you have satisfied the [Prerequisites for using Amazon SageMaker Inference Recommender](inference-recommender-prerequisites.md).

## Prerequisites
<a name="inference-recommender-existing-endpoint-prerequisites"></a>

If you don’t already have a SageMaker AI Inference endpoint, you can either [get an inference recommendation](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-instance-recommendation.html) without an endpoint, or you can create a Real-Time Inference endpoint by following the instructions in [Create your endpoint and deploy your model](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html).

## Create an inference recommendation job for an existing endpoint
<a name="inference-recommender-existing-endpoint-create"></a>

Create an inference recommendation programmatically using AWS SDK for Python (Boto3), or the AWS CLI. Specify a job name for your inference recommendation, the name of an existing SageMaker AI Inference endpoint, an AWS IAM role ARN, an input configuration, and your model package ARN from when you registered your model with the model registry.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html) API to get an inference recommendation. Set the `JobType` field to `'Default'` for inference recommendation jobs. In addition, provide the following:
+ Provide a name for your Inference Recommender recommendation job for the `JobName` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ The ARN of the versioned model package you created when you registered your model with the model registry. Define this for `ModelPackageVersionArn` in the `InputConfig` field.
+ Provide the name of an existing SageMaker AI Inference endpoint that you want to benchmark in Inference Recommender for `Endpoints` in the `InputConfig` field.

Import the AWS SDK for Python (Boto3) package and create a SageMaker AI client object using the client class. If you followed the steps in the **Prerequisites** section, the model package group ARN was stored in a variable named `model_package_arn`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<region>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide your model package ARN that was created when you registered your 
# model with Model Registry 
model_package_arn = '<model-package-arn>'

# Provide a unique job name for SageMaker Inference Recommender job
job_name = '<job-name>'

# Inference Recommender job type. Set to Default to get an initial recommendation
job_type = 'Default'

# Provide an IAM Role that gives SageMaker Inference Recommender permission to 
# access AWS services
role_arn = '<arn:aws:iam::<account>:role/*>'
                                    
# Provide endpoint name for your endpoint that want to benchmark in Inference Recommender
endpoint_name = '<existing-endpoint-name>'

sagemaker_client.create_inference_recommendations_job(
    JobName = job_name,
    JobType = job_type,
    RoleArn = role_arn,
    InputConfig = {
        'ModelPackageVersionArn': model_package_arn,
        'Endpoints': [{'EndpointName': endpoint_name}]
    }
)
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceRecommendationsJob.html).

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to get an instance endpoint recommendation. Set the `job-type` field to `'Default'` for instance endpoint recommendation jobs. In addition, provide the following:
+ Provide a name for your Inference Recommender recommendation job for the `job-name` field. The Inference Recommender job name must be unique within the AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Amazon SageMaker Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ The ARN of the versioned model package you created when you registered your model with Model Registry. Define this for `ModelPackageVersionArn` in the `input-config` field.
+ Provide the name of an existing SageMaker AI Inference endpoint that you want to benchmark in Inference Recommender for `Endpoints` in the `input-config` field.

```
aws sagemaker create-inference-recommendations-job 
    --region <region>\
    --job-name <job_name>\
    --job-type Default\
    --role-arn arn:aws:iam::<account:role/*>\
    --input-config "{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region:account:role/*>\",
        \"Endpoints\": [{\"EndpointName\": <endpoint_name>}]
        }"
```

------

## Get your inference recommendation job results
<a name="inference-recommender-existing-endpoint-results"></a>

You can collect the results of your inference recommendation job programmatically with the same procedure for standard inference recommendation jobs. For more information, see [Get your inference recommendation job results](instance-recommendation-results.md).

When you get inference recommendation job results for an existing endpoint, you should receive a JSON response similar to the following:

```
{
    "JobName": "job-name",
    "JobType": "Default",
    "JobArn": "arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id",
    "RoleArn": "iam-role-arn",
    "Status": "COMPLETED",
    "CreationTime": 1664922919.2,
    "LastModifiedTime": 1664924208.291,
    "InputConfig": {
        "ModelPackageVersionArn": "arn:aws:sagemaker:region:account-id:model-package/resource-id",
        "Endpoints": [
            {
                "EndpointName": "endpoint-name"
            }
        ]
    },
    "InferenceRecommendations": [
        {
            "Metrics": {
                "CostPerHour": 0.7360000014305115,
                "CostPerInference": 7.456940238625975e-06,
                "MaxInvocations": 1645,
                "ModelLatency": 171
            },
            "EndpointConfiguration": {
                "EndpointName": "sm-endpoint-name",
                "VariantName": "variant-name",
                "InstanceType": "ml.g4dn.xlarge",
                "InitialInstanceCount": 1
            },
            "ModelConfiguration": {
                "EnvironmentParameters": [
                    {
                        "Key": "TS_DEFAULT_WORKERS_PER_MODEL",
                        "ValueType": "string",
                        "Value": "4"
                    }
                ]
            }
        }
    ],
    "EndpointPerformances": [
        {
            "Metrics": {
                "MaxInvocations": 184,
                "ModelLatency": 1312
            },
            "EndpointConfiguration": {
                "EndpointName": "endpoint-name"
            }
        }
    ]
}
```

The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and latest modification times.

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) that was used during the recommendation job.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) in US dollars for your real-time endpoint, the expected maximum number of `InvokeEndpoint` requests per minute sent to the endpoint (`MaxInvocations`), and the model latency (`ModelLatency`), which is the interval of time (in milliseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

The `EndpointPerformances` nested dictionary contains the name of your existing endpoint on which the recommendation job was run (`EndpointName`) and the performance metrics for your endpoint (`MaxInvocations` and `ModelLatency`).

# Stop your inference recommendation
<a name="instance-recommendation-stop"></a>

You might want to stop a job that is currently running if you began a job by mistake or no longer need to run the job. Stop your Inference Recommender inference recommendation jobs programmatically with the `StopInferenceRecommendationsJob` API or with Studio Classic.

------
#### [ AWS SDK for Python (Boto3) ]

Specify the name of the inference recommendation job for the `JobName` field:

```
sagemaker_client.stop_inference_recommendations_job(
                                    JobName='<INSERT>'
                                    )
```

------
#### [ AWS CLI ]

Specify the job name of the inference recommendation job for the `job-name` flag:

```
aws sagemaker stop-inference-recommendations-job --job-name <job-name>
```

------
#### [ Amazon SageMaker Studio Classic ]

Close the tab in which you initiated the inference recommendation to stop your Inference Recommender inference recommendation.

------
#### [ SageMaker AI console ]

To stop your instance recommendation job through the SageMaker AI console, do the following:



1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, select your instance recommendation job.

1. Choose **Stop job**.

1. In the dialog box that pops up, choose **Confirm**.

After stopping your job, the job’s **Status** should change to **Stopping**.

------

# Compiled recommendations with Neo
<a name="inference-recommender-neo-compilation"></a>

In Inference Recommender, you can compile your model with Neo and get endpoint recommendations for your compiled model. [SageMaker Neo](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) is a service that can optimize your model for a target hardware platform (that is, a specific instance type or environment). Optimizing a model with Neo might improve the performance of your hosted model.

For Neo-supported frameworks and containers, Inference Recommender automatically suggests Neo-optimized recommendations. To be eligible for Neo compilation, your input must meet the following prerequisites:
+ You are using a SageMaker AI owned [DLC ](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html) or XGBoost container.
+ You are using a framework version supported by Neo. For the framework versions supported by Neo, see [Cloud Instances](neo-supported-cloud.md#neo-supported-cloud-instances) in the SageMaker Neo documentation.
+ Neo requires that you provide a correct input data shape for your model. You can specify this data shape as the `[DataInputConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelInput.html#sagemaker-Type-ModelInput-DataInputConfig)` in the `[InferenceSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html#sagemaker-CreateModelPackage-request-InferenceSpecification)` when you create a model package. For information about the correct data shapes for each framework, see [Prepare Model for Compilation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-compilation-preparing-model.html) in the SageMaker Neo documentation.

  The following example shows how to specify the `DataInputConfig` field in the `InferenceSpecification`, where `data_input_configuration` is a variable that contains the data shape in dictionary format (for example, `{'input':[1,1024,1024,3]}`).

  ```
  "InferenceSpecification": {
          "Containers": [
              {
                  "Image": dlc_uri,
                  "Framework": framework.upper(),
                  "FrameworkVersion": framework_version,
                  "NearestModelName": model_name,
                  "ModelInput": {"DataInputConfig": data_input_configuration},
              }
          ],
          "SupportedContentTypes": input_mime_types,  # required, must be non-null
          "SupportedResponseMIMETypes": [],
          "SupportedRealtimeInferenceInstanceTypes": supported_realtime_inference_types,  # optional
      }
  ```

If these conditions are met in your request, then Inference Recommender runs scenarios for both compiled and uncompiled versions of your model, giving you multiple recommendation combinations to choose from. You can compare the configurations for compiled and uncompiled versions of the same inference recommendation and determine which one best suits your use case. The recommendations are ranked by cost per inference.

To get the Neo compilation recommendations, you don’t have to do any additional configuration besides making sure that your input meets the preceding requirements. Inference Recommender automatically runs Neo compilation on your model if your input meets the requirements, and you receive a response that includes Neo recommendations.

If you run into errors during your Neo compilation, see [Troubleshoot Neo Compilation Errors](neo-troubleshooting-compilation.md).

The following table is an example of a response you might get from an Inference Recommender job that includes recommendations for compiled models. If the `InferenceSpecificationName` field is `None`, then the recommendation is an uncompiled model. The last row, in which the value for the **InferenceSpecificationName** field is `neo-00011122-2333-4445-5566-677788899900`, is for a model compiled with Neo. The value in the field is the name of the Neo job used to compile and optimize your model.


| EndpointName | InstanceType | InitialInstanceCount | EnvironmentParameters | CostPerHour | CostPerInference | MaxInvocations | ModelLatency | InferenceSpecificationName | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| sm-epc-example-000111222 | ml.c5.9xlarge | 1 | [] | 1.836 | 9.15E-07 | 33456 | 7 | None | 
| sm-epc-example-111222333 | ml.c5.2xlarge | 1 | [] | 0.408 | 2.11E-07 | 32211 | 21 | None | 
| sm-epc-example-222333444 | ml.c5.xlarge | 1 | [] | 0.204 | 1.86E-07 | 18276 | 92 | None | 
| sm-epc-example-333444555 | ml.c5.xlarge | 1 | [] | 0.204 | 1.60E-07 | 21286 | 42 | neo-00011122-2333-4445-5566-677788899900 | 

## Get started
<a name="inference-recommender-neo-compilation-get-started"></a>

The general steps for creating an Inference Recommender job that includes Neo-optimized recommendations are as follows:
+ Prepare your ML model for compilation. For more information, see [Prepare Model for Compilation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-compilation-preparing-model.html) in the Neo documentation.
+ Package your model in a model archive (`.tar.gz` file).
+ Create a sample payload archive.
+ Register your model in SageMaker Model Registry.
+ Create an Inference Recommender job.
+ View the results of the Inference Recommender job and choose a configuration.
+ Debug compilation failures, if any. For more information, see [Troubleshoot Neo Compilation Errors](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html).

For an example that demonstrates the previous workflow and how to get Neo-optimized recommendations using XGBoost, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/xgboost/xgboost-inference-recommender.ipynb). For an example that show how to get Neo-optimized recommendations using TensorFlow, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/inference-recommender.ipynb).

# Recommendation results
<a name="inference-recommender-interpret-results"></a>

Each Inference Recommender job result includes `InstanceType`, `InitialInstanceCount`, and `EnvironmentParameters`, which are tuned environment variable parameters for your container to improve its latency and throughput. The results also include performance and cost metrics such as `MaxInvocations`, `ModelLatency`, `CostPerHour`, `CostPerInference`, `CpuUtilization`, and `MemoryUtilization`.

In the table below we provide a description of these metrics. These metrics can help you narrow down your search for the best endpoint configuration that suits your use case. For example, if your motivation is overall price performance with an emphasis on throughput, then you should focus on `CostPerInference`. 


| Metric | Description | Use case | 
| --- | --- | --- | 
|  `ModelLatency`  |  The interval of time taken by a model to respond as viewed from SageMaker AI. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. Units: Milliseconds  | Latency sensitive workloads such as ad serving and medical diagnosis | 
|  `MaximumInvocations`  |  The maximum number of `InvokeEndpoint` requests sent to a model endpoint in a minute. Units: None  | Throughput-focused workloads such as video processing or batch inference | 
|  `CostPerHour`  |  The estimated cost per hour for your real-time endpoint. Units: US Dollars  | Cost sensitive workloads with no latency deadlines | 
|  `CostPerInference`  |  The estimated cost per inference call for your real-time endpoint. Units: US Dollars  | Maximize overall price performance with a focus on throughput | 
|  `CpuUtilization`  |  The expected CPU utilization at maximum invocations per minute for the endpoint instance. Units: Percent  | Understand instance health during benchmarking by having visibility into core CPU utilization of the instance | 
|  `MemoryUtilization`  |  The expected memory utilization at maximum invocations per minute for the endpoint instance. Units: Percent  | Understand instance health during benchmarking by having visibility into core memory utilization of the instance | 

In some cases you might want to explore other [SageMaker AI Endpoint Invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation) such as `CPUUtilization`. Every Inference Recommender job result includes the names of endpoints spun up during the load test. You can use CloudWatch to review the logs for these endpoints even after they’ve been deleted.

The following image is an example of CloudWatch metrics and charts you can review for a single endpoint from your recommendation result. This recommendation result is from a Default job. The way to interpret the scalar values from the recommendation results is that they are based on the time point when the Invocations graph first begins to level out. For example, the `ModelLatency` value reported is at the beginning of the plateau around `03:00:31`.

![\[Charts for CloudWatch metrics.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference-recommender-cw-metrics.png)


For full descriptions of the CloudWatch metrics used in the preceding charts, see [SageMaker AI Endpoint Invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation).

You can also see performance metrics like `ClientInvocations` and `NumberOfUsers` published by Inference Recommender in the `/aws/sagemaker/InferenceRecommendationsJobs` namespace. For a full list of metrics and descriptions published by Inference Recommender, see [SageMaker Inference Recommender jobs metrics](monitoring-cloudwatch.md#cloudwatch-metrics-inference-recommender).

See the [Amazon SageMaker Inference Recommender - CloudWatch Metrics](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-inference-recommender/tensorflow-cloudwatch/tf-cloudwatch-inference-recommender.ipynb) Jupyter notebook in the [amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) Github repository for an example of how to use the AWS SDK for Python (Boto3) to explore CloudWatch metrics for your endpoints.

# Get autoscaling policy recommendations
<a name="inference-recommender-autoscaling"></a>

With Amazon SageMaker Inference Recommender, you can get recommendations for autoscaling policies for your SageMaker AI endpoint based on your anticipated traffic pattern. If you’ve already completed an inference recommendation job, you can provide the details of the job to get a recommendation for an autoscaling policy that you can apply to your endpoint.

Inference Recommender benchmarks different values for each metric to determine the ideal autoscaling configuration for your endpoint. The autoscaling recommendation returns a recommended autoscaling policy for each metric that was defined in your inference recommendation job. You can save the policies and apply them to your endpoint with the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API.

To get started, review the following prerequisites.

## Prerequisites
<a name="inference-recommender-autoscaling-prereqs"></a>

Before you begin, you must have completed a successful inference recommendation job. In the following section, you can provide either an inference recommendation ID or the name of a SageMaker AI endpoint that was benchmarked during an inference recommendation job.

To retrieve your recommendation job ID or endpoint name, you can either view the details of your inference recommendation job in the SageMaker AI console, or you can use the `RecommendationId` or `EndpointName` fields returned by the [DescribeInferenceRecommendationsJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeInferenceRecommendationsJob.html) API.

## Create an autoscaling configuration recommendation
<a name="inference-recommender-autoscaling-create"></a>

To create an autoscaling recommendation policy, you can use the AWS SDK for Python (Boto3).

The following example shows the fields for the [ GetScalingConfigurationRecommendation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_GetScalingConfigurationRecommendation.html) API. Use the following fields when you call the API:
+ `InferenceRecommendationsJobName` – Enter the name of your inference recommendation job.
+ `RecommendationId` – Enter the ID of an inference recommendation from a recommendation job. This is optional if you’ve specified the `EndpointName` field.
+ `EndpointName` – Enter the name of an endpoint that was benchmarked during an inference recommendation job. This is optional if you’ve specified the `RecommendationId` field.
+ `TargetCpuUtilizationPerCore` – (Optional) Enter a percentage value of how much utilization you want an instance on your endpoint to use before autoscaling. The default value if you don’t specify this field is 50%.
+ `ScalingPolicyObjective` – (Optional) An object where you specify your anticipated traffic pattern.
  + `MinInvocationsPerMinute` – (Optional) The minimum number of expected requests to your endpoint per minute.
  + `MaxInvocationsPerMinute` – (Optional) The maximum number of expected requests to your endpoint per minute.

```
{
    "InferenceRecommendationsJobName": "string", // Required
    "RecommendationId": "string", // Optional, provide one of RecommendationId or EndpointName
    "EndpointName": "string", // Optional, provide one of RecommendationId or EndpointName
    "TargetCpuUtilizationPerCore": number, // Optional
    "ScalingPolicyObjective": { // Optional
        "MinInvocationsPerMinute": number,
        "MaxInvocationsPerMinute": number
    }
}
```

After submitting your request, you’ll receive a response with autoscaling policies defined for each metric. See the following section for information about interpreting the response.

## Review your autoscaling configuration recommendation results
<a name="inference-recommender-autoscaling-review"></a>

The following example shows the response from the [ GetScalingConfigurationRecommendation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_GetScalingConfigurationRecommendation.html) API:

```
{
    "InferenceRecommendationsJobName": "string", 
    "RecommendationId": "string", // One of RecommendationId or EndpointName is shown
    "EndpointName": "string", 
    "TargetUtilizationPercentage": Integer,
    "ScalingPolicyObjective": { 
        "MinInvocationsPerMinute": Integer, 
        "MaxInvocationsPerMinute": Integer
    },
    "Metric": {
        "ModelLatency": Integer,
        "InvocationsPerInstance": Integer
    },
    "DynamicScalingConfiguration": {
        "MinCapacity": number,
        "MaxCapacity": number, 
        "ScaleInCooldown": number,
        "ScaleOutCooldown": number,
        "ScalingPolicies": [
            {
                "TargetTracking": {
                    "MetricSpecification": {
                        "Predefined" {
                            "PredefinedMetricType": "string"
                         },
                        "Customized": {
                            "MetricName": "string",
                            "Namespace": "string",
                            "Statistic": "string"
                         }
                    },
                    "TargetValue": Double
                } 
            }
        ]
    }
}
```

The `InferenceRecommendationsJobName`, `RecommendationID` or `EndpointName`, `TargetCpuUtilizationPerCore`, and the `ScalingPolicyObjective` object fields are copied from your initial request.

The `Metric` object lists the metrics that were benchmarked in your inference recommendation job, along with a calculation of the values for each metric when the instance utilization would be the same as the `TargetCpuUtilizationPerCore` value. This is useful for anticipating the performance metrics on your endpoint when it scales in and out with the recommended autoscaling policy. For example, consider if your instance utilization was 50% in your inference recommendation job and your `InvocationsPerInstance` value was originally `4`. If you specify the `TargetCpuUtilizationPerCore` value to be 100% in your autoscaling recommendation request, then the `InvocationsPerInstance` metric value returned in the response is `2` because you anticipated allocating twice as much instance utilization.

The `DynamicScalingConfiguration` object returns the values that you should specify for the [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html#autoscaling-PutScalingPolicy-request-TargetTrackingScalingPolicyConfiguration) when you call the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API. This includes the recommended minimum and maximum capacity values, the recommended scale in and scale out cooldown times, and the `ScalingPolicies` object, which contains the recommended `TargetValue` you should specify for each metric.

# Run a custom load test
<a name="inference-recommender-load-test"></a>

Amazon SageMaker Inference Recommender load tests conduct extensive benchmarks based on production requirements for latency and throughput, custom traffic patterns, and either serverless endpoints or real-time instances (up to 10) that you select.

The following sections demonstrate how to create, describe, and stop a load test programmatically using the AWS SDK for Python (Boto3) and the AWS CLI, or interactively using Amazon SageMaker Studio Classic or the SageMaker AI console.

## Create a load test job
<a name="load-test-create"></a>

Create a load test programmatically using the AWS SDK for Python (Boto3), with the AWS CLI, or interactively using Studio Classic or the SageMaker AI console. As with Inference Recommender inference recommendations, specify a job name for your load test, an AWS IAM role ARN, an input configuration, and your model package ARN from when you registered your model with the model registry. Load tests require that you also specify a traffic pattern and stopping conditions.

------
#### [ AWS SDK for Python (Boto3) ]

Use the `CreateInferenceRecommendationsJob` API to create an Inference Recommender load test. Specify `Advanced` for the `JobType` field and provide: 
+ A job name for your load test (`JobName`). The job name must be unique within your AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `RoleArn` field.
+ An endpoint configuration dictionary (`InputConfig`) where you specify the following:
  + For `TrafficPattern`, specify either the phases or stairs traffic pattern. With the phases traffic pattern, new users spawn every minute at the rate you specify. With the stairs traffic pattern, new users spawn at timed intervals (or *steps*) at a rate you specify. Choose one of the following:
    + For `TrafficType`, specify `PHASES`. Then, for the `Phases` array, specify the `InitialNumberOfUsers` (how many concurrent users to start with, with a minimum of 1 and a maximum of 3), `SpawnRate` (the number of users to be spawned in a minute for a specific phase of load testing, with a minimum of 0 and maximum of 3), and `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600).
    + For `TrafficType`, specify `STAIRS`. Then, for the `Stairs` array, specify the `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600), `NumberOfSteps` (how many intervals are used during the phase), and `UsersPerStep` (how many users are added during each interval). Note that the length of each step is the value of `DurationInSeconds / NumberOfSteps`. For example, if your `DurationInSeconds` is `600` and you specify `5` steps, then each step is 120 seconds long.
**Note**  
A user is defined as a system-generated actor that runs in a loop and invokes requests to an endpoint as part of Inference Recommender. For a typical XGBoost container running on an `ml.c5.large` instance, endpoints can reach 30,000 invocations per minute (500 tps) with just 15-20 users.
  + For `ResourceLimit`, specify `MaxNumberOfTests` (the maximum number of benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10) and `MaxParallelOfTests` (the maximum number of parallel benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10).
  + For `EndpointConfigurations`, you can specify one of the following:
    + The `InstanceType` field, where you specify the instance type on which you want to run your load tests.
    + The `ServerlessConfig`, in which you specify your ideal values for `MaxConcurrency` and `MemorySizeInMB` for a serverless endpoint. For more information, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).
+ A stopping conditions dictionary (`StoppingConditions`), where if any of the conditions are met, the Inference Recommender job stops. For this example, specify the following fields in the dictionary:
  + For `MaxInvocations`, specify the maximum number of requests per minute expected for the endpoint, with a minimum of 1 and a maximum of 30,000.
  + For `ModelLatencyThresholds`, specify `Percentile` (the model latency percentile threshold) and `ValueInMilliseconds` (the model latency percentile value in milliseconds).
  + (Optional) For `FlatInvocations`, you can specify whether to continue the load test when the TPS (invocations per minute) rate flattens. A flattened TPS rate usually means that the endpoint has reached capacity. However, you might want to continue monitoring the endpoint under full capacity conditions. To continue the load test when this happens, specify this value as `Continue`. Otherwise, the default value is `Stop`.

```
# Create a low-level SageMaker service client.
import boto3
aws_region=<INSERT>
sagemaker_client=boto3.client('sagemaker', region=aws_region) 
                
# Provide a name to your recommendation based on load testing
load_test_job_name="<INSERT>"

# Provide the name of the sagemaker instance type
instance_type="<INSERT>"

# Provide the IAM Role that gives SageMaker permission to access AWS services 
role_arn='arn:aws:iam::<account>:role/*'

# Provide your model package ARN that was created when you registered your 
# model with Model Registry
model_package_arn='arn:aws:sagemaker:<region>:<account>:role/*'

sagemaker_client.create_inference_recommendations_job(
                        JobName=load_test_job_name,
                        JobType="Advanced",
                        RoleArn=role_arn,
                        InputConfig={
                            'ModelPackageVersionArn': model_package_arn,
                            "JobDurationInSeconds": 7200,
                            'TrafficPattern' : {
                                # Replace PHASES with STAIRS to use the stairs traffic pattern
                                'TrafficType': 'PHASES',
                                'Phases': [
                                    {
                                        'InitialNumberOfUsers': 1,
                                        'SpawnRate': 1,
                                        'DurationInSeconds': 120
                                    },
                                    {
                                        'InitialNumberOfUsers': 1,
                                        'SpawnRate': 1,
                                        'DurationInSeconds': 120
                                    }
                                ]
                                # Uncomment this section and comment out the Phases object above to use the stairs traffic pattern
                                # 'Stairs' : {
                                #   'DurationInSeconds': 240,
                                #   'NumberOfSteps': 2,
                                #   'UsersPerStep': 2
                                # }
                            },
                            'ResourceLimit': {
                                        'MaxNumberOfTests': 10,
                                        'MaxParallelOfTests': 3
                                },
                            "EndpointConfigurations" : [{
                                        'InstanceType': 'ml.c5.xlarge'
                                    },
                                    {
                                        'InstanceType': 'ml.m5.xlarge'
                                    },
                                    {
                                        'InstanceType': 'ml.r5.xlarge'
                                    }]
                                    # Uncomment the ServerlessConfig and comment out the InstanceType field if you want recommendations for a serverless endpoint
                                    # "ServerlessConfig": {
                                    #     "MaxConcurrency": value, 
                                    #     "MemorySizeInMB": value 
                                    # }
                        },
                        StoppingConditions={
                            'MaxInvocations': 1000,
                            'ModelLatencyThresholds':[{
                                'Percentile': 'P95', 
                                'ValueInMilliseconds': 100
                            }],
                            # Change 'Stop' to 'Continue' to let the load test continue if invocations flatten 
                            'FlatInvocations': 'Stop'
                        }
                )
```

See the [Amazon SageMaker API Reference Guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html) for a full list of optional and required arguments you can pass to `CreateInferenceRecommendationsJob`.

------
#### [ AWS CLI ]

Use the `create-inference-recommendations-job` API to create an Inference Recommender load test. Specify `Advanced` for the `JobType` field and provide: 
+ A job name for your load test (`job-name`). The job name must be unique within your AWS Region and within your AWS account.
+ The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf. Define this for the `role-arn` field.
+ An endpoint configuration dictionary (`input-config`) where you specify the following:
  + For `TrafficPattern`, specify either the phases or stairs traffic pattern. With the phases traffic pattern, new users spawn every minute at the rate you specify. With the stairs traffic pattern, new users spawn at timed intervals (or *steps*) at a rate you specify. Choose one of the following:
    + For `TrafficType`, specify `PHASES`. Then, for the `Phases` array, specify the `InitialNumberOfUsers` (how many concurrent users to start with, with a minimum of 1 and a maximum of 3), `SpawnRate` (the number of users to be spawned in a minute for a specific phase of load testing, with a minimum of 0 and maximum of 3), and `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600).
    + For `TrafficType`, specify `STAIRS`. Then, for the `Stairs` array, specify the `DurationInSeconds` (how long the traffic phase should be, with a minimum of 120 and maximum of 3600), `NumberOfSteps` (how many intervals are used during the phase), and `UsersPerStep` (how many users are added during each interval). Note that the length of each step is the value of `DurationInSeconds / NumberOfSteps`. For example, if your `DurationInSeconds` is `600` and you specify `5` steps, then each step is 120 seconds long.
**Note**  
A user is defined as a system-generated actor that runs in a loop and invokes requests to an endpoint as part of Inference Recommender. For a typical XGBoost container running on an `ml.c5.large` instance, endpoints can reach 30,000 invocations per minute (500 tps) with just 15-20 users.
  + For `ResourceLimit`, specify `MaxNumberOfTests` (the maximum number of benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10) and `MaxParallelOfTests` (the maximum number of parallel benchmarking load tests for an Inference Recommender job, with a minimum of 1 and a maximum of 10).
  + For `EndpointConfigurations`, you can specify one of the following:
    + The `InstanceType` field, where you specify the instance type on which you want to run your load tests.
    + The `ServerlessConfig`, in which you specify your ideal values for `MaxConcurrency` and `MemorySizeInMB` for a serverless endpoint.
+ A stopping conditions dictionary (`stopping-conditions`), where if any of the conditions are met, the Inference Recommender job stops. For this example, specify the following fields in the dictionary:
  + For `MaxInvocations`, specify the maximum number of requests per minute expected for the endpoint, with a minimum of 1 and a maximum of 30,000.
  + For `ModelLatencyThresholds`, specify `Percentile` (the model latency percentile threshold) and `ValueInMilliseconds` (the model latency percentile value in milliseconds).
  + (Optional) For `FlatInvocations`, you can specify whether to continue the load test when the TPS (invocations per minute) rate flattens. A flattened TPS rate usually means that the endpoint has reached capacity. However, you might want to continue monitoring the endpoint under full capacity conditions. To continue the load test when this happens, specify this value as `Continue`. Otherwise, the default value is `Stop`.

```
aws sagemaker create-inference-recommendations-job\
    --region <region>\
    --job-name <job-name>\
    --job-type ADVANCED\
    --role-arn arn:aws:iam::<account>:role/*\
    --input-config \"{
        \"ModelPackageVersionArn\": \"arn:aws:sagemaker:<region>:<account>:role/*\",
        \"JobDurationInSeconds\": 7200,                                
        \"TrafficPattern\" : {
                # Replace PHASES with STAIRS to use the stairs traffic pattern
                \"TrafficType\": \"PHASES\",
                \"Phases\": [
                    {
                        \"InitialNumberOfUsers\": 1,
                        \"SpawnRate\": 60,
                        \"DurationInSeconds\": 300
                    }
                ]
                # Uncomment this section and comment out the Phases object above to use the stairs traffic pattern
                # 'Stairs' : {
                #   'DurationInSeconds': 240,
                #   'NumberOfSteps': 2,
                #   'UsersPerStep': 2
                # }
            },
            \"ResourceLimit\": {
                \"MaxNumberOfTests\": 10,
                \"MaxParallelOfTests\": 3
            },
            \"EndpointConfigurations\" : [
                {
                    \"InstanceType\": \"ml.c5.xlarge\"
                },
                {
                    \"InstanceType\": \"ml.m5.xlarge\"
                },
                {
                    \"InstanceType\": \"ml.r5.xlarge\"
                }
                # Use the ServerlessConfig and leave out the InstanceType fields if you want recommendations for a serverless endpoint
                # \"ServerlessConfig\": {
                #     \"MaxConcurrency\": value, 
                #     \"MemorySizeInMB\": value 
                # }
            ]
        }\"
    --stopping-conditions \"{
        \"MaxInvocations\": 1000,
        \"ModelLatencyThresholds\":[
                {
                    \"Percentile\": \"P95\", 
                    \"ValueInMilliseconds\": 100
                }
        ],
        # Change 'Stop' to 'Continue' to let the load test continue if invocations flatten 
        \"FlatInvocations\": \"Stop\"
    }\"
```

------
#### [ Amazon SageMaker Studio Classic ]

Create a load test with Studio Classic.

1. In your Studio Classic application, choose the home icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/house.png)).

1. In the left sidebar of Studio Classic, choose **Deployments**.

1. Choose **Inference recommender** from the dropdown list.

1. Choose **Create inference recommender job**. A new tab titled **Create inference recommender job** opens.

1. Select the name of your model group from the dropdown **Model group** field. The list includes all the model groups registered with the model registry in your account, including models registered outside of Studio Classic.

1. Select a model version from the dropdown **Model version** field.

1. Choose **Continue**.

1. Provide a name for the job in the **Name** field.

1. (Optional) Provide a description of your job in the **Description** field.

1. Choose an IAM role that grants Inference Recommender permission to access AWS services. You can create a role and attach the `AmazonSageMakerFullAccess` IAM managed policy to accomplish this, or you can let Studio Classic create a role for you.

1. Choose **Stopping Conditions** to expand the available input fields. Provide a set of conditions for stopping a deployment recommendation. 

   1. Specify the maximum number of requests per minute expected for the endpoint in the **Max Invocations Per Minute** field.

   1. Specify the model latency threshold in microseconds in the **Model Latency Threshold** field. The **Model Latency Threshold** depicts the interval of time taken by a model to respond as viewed from Inference Recommender. The interval includes the local communication time taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

1. Choose **Traffic Pattern** to expand the available input fields.

   1. Set the initial number of virtual users by specifying an integer in the **Initial Number of Users** field.

   1. Provide an integer number for the **Spawn Rate** field. The spawn rate sets the number of users created per second.

   1. Set the duration for the phase in seconds by specifying an integer in the **Duration** field.

   1. (Optional) Add additional traffic patterns. To do so, choose **Add**.

1. Choose the **Additional** setting to reveal the **Max test duration** field. Specify, in seconds, the maximum time a test can take during a job. New jobs are not scheduled after the defined duration. This helps ensure jobs that are in progress are not stopped and that you only view completed jobs.

1. Choose **Continue**.

1. Choose **Selected Instances**.

1. In the **Instances for benchmarking** field, choose **Add instances to test**. Select up to 10 instances for Inference Recommender to use for load testing.

1. Choose **Additional settings**.

   1. Provide an integer that sets an upper limit on the number of tests a job can make for the **Max number of tests field**. Note that each endpoint configuration results in a new load test.

   1. Provide an integer for the **Max parallel** test field. This setting defines an upper limit on the number of load tests that can run in parallel.

1. Choose **Submit**.

   The load test can take up to 2 hours.
**Warning**  
Do not close this tab. If you close this tab, you cancel the Inference Recommender load test job.

------
#### [ SageMaker AI console ]

Create a custom load test through the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose **Create job**.

1. For **Step 1: Model configuration**, do the following:

   1. For **Job type**, choose **Advanced recommender job**.

   1. If you’re using a model registered in the SageMaker AI model registry, then turn on the **Choose a model from the model registry** toggle and do the following:

      1. For the **Model group** dropdown list, choose the model group in SageMaker AI model registry where your model is.

      1. For the **Model version** dropdown list, choose the desired version of your model.

   1. If you’re using a model that you’ve created in SageMaker AI, then turn off the **Choose a model from the model registry** toggle and do the following:

      1. For the **Model name** field, enter the name of your SageMaker AI model.

   1. For **IAM role**, you can select an existing AWS IAM role that has the necessary permissions to create an instance recommendation job. Alternatively, if you don’t have an existing role, you can choose **Create a new role** to open the role creation pop-up, and SageMaker AI adds the necessary permissions to the new role that you create.

   1. For **S3 bucket for benchmarking payload**, enter the Amazon S3 path to your sample payload archive, which should contain sample payload files that Inference Recommender uses to benchmark your model on different instance types.

   1. For **Payload content type**, enter the MIME types of your sample payload data.

   1. For **Traffic pattern**, configure phases for the load test by doing the following:

      1. For **Initial number of users**, specify how many concurrent users you want to start with (with a minimum of 1 and a maximum of 3).

      1. For **Spawn rate**, specify the number of users to be spawned in a minute for the phase (with a minimum of 0 and a maximum of 3).

      1. For **Duration (seconds)**, specify how low the traffic phase should be in seconds (with a minimum of 120 and a maximum of 3600).

   1. (Optional) If you turned off the **Choose a model from the model registry toggle** and specified a SageMaker AI model, then for **Container configuration**, do the following:

      1. For the **Domain** dropdown list, select the machine learning domain of the model, such as computer vision, natural language processing, or machine learning.

      1. For the **Framework** dropdown list, select the framework of your container, such as TensorFlow or XGBoost.

      1. For **Framework version**, enter the framework version of your container image.

      1. For the **Nearest model name** dropdown list, select the pre-trained model that mostly closely matches your own.

      1. For the **Task** dropdown list, select the machine learning task that the model accomplishes, such as image classification or regression.

   1. (Optional) For **Model compilation using SageMaker Neo**, you can configure the recommendation job for a model that you’ve compiled using SageMaker Neo. For **Data input configuration**, enter the correct input data shape for your model in a format similar to `{'input':[1,1024,1024,3]}`.

   1. Choose **Next**.

1. For **Step 2: Instances and environment parameters**, do the following:

   1. For **Select instances for benchmarking**, select up to 8 instance types that you want to benchmark against.

   1. (Optional) For **Environment parameter ranges**, you can specify environment parameters that help optimize your model. Specify the parameters as **Key** and **Value** pairs.

   1. Choose **Next**.

1. For **Step 3: Job parameters**, do the following:

   1. (Optional) For the **Job name** field, enter a name for your instance recommendation job. When you create the job, SageMaker AI appends a timestamp to the end of this name.

   1. (Optional) For the **Job description** field, enter a description for the job.

   1. (Optional) For the **Encryption key** dropdown list, choose an AWS KMS key by name or enter its ARN to encrypt your data.

   1. (Optional) For **Max number of tests**, enter the number of test that you want to run during the recommendation job.

   1. (Optional) For **Max parallel tests**, enter the maximum number of parallel tests that you want to run during the recommendation job.

   1. For **Max test duration (s)**, enter the maximum number of seconds you want each test to run for.

   1. For **Max invocations per minute**, enter the maximum number of requests per minute the endpoint can reach before stopping the recommendation job. After reaching this limit, SageMaker AI ends the job.

   1. For **P99 Model latency threshold (ms)**, enter the model latency percentile in milliseconds.

   1. Choose **Next**.

1. For **Step 4: Review job**, review your configurations and then choose **Submit**.

------

## Get your load test results
<a name="load-test-describe"></a>

You can programmatically collect metrics across all load tests once the load tests are done with AWS SDK for Python (Boto3), the AWS CLI, Studio Classic, or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Collect metrics with the `DescribeInferenceRecommendationsJob` API. Specify the job name of the load test for the `JobName` field:

```
load_test_response = sagemaker_client.describe_inference_recommendations_job(
                                                        JobName=load_test_job_name
                                                        )
```

Print the response object.

```
load_test_response['Status']
```

This returns a JSON response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Advanced', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 19, 38, 30, 957000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 19, 46, 31, 399000, tzinfo=tzlocal()), 
    'InputConfig': {
        'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
        'JobDurationInSeconds': 7200, 
        'TrafficPattern': {
            'TrafficType': 'PHASES'
            }, 
        'ResourceLimit': {
            'MaxNumberOfTests': 100, 
            'MaxParallelOfTests': 100
            }, 
        'EndpointConfigurations': [{
            'InstanceType': 'ml.c5d.xlarge'
            }]
        }, 
    'StoppingConditions': {
        'MaxInvocations': 1000, 
        'ModelLatencyThresholds': [{
            'Percentile': 'P95', 
            'ValueInMilliseconds': 100}
            ]}, 
    'InferenceRecommendations': [{
        'Metrics': {
            'CostPerHour': 0.6899999976158142, 
            'CostPerInference': 1.0332434612791985e-05, 
            'MaximumInvocations': 1113, 
            'ModelLatency': 100000
            }, 
    'EndpointConfiguration': {
        'EndpointName': 'endpoint-name', 
        'VariantName': 'variant-name', 
        'InstanceType': 'ml.c5d.xlarge', 
        'InitialInstanceCount': 3
        }, 
    'ModelConfiguration': {
        'Compiled': False, 
        'EnvironmentParameters': []
        }
    }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1199', 
            'date': 'Tue, 26 Oct 2021 19:57:42 GMT'
            }, 
        'RetryAttempts': 0}
    }
```

The first few lines provide information about the load test job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `EndpointConfiguration` nested dictionary also contains the instance count (`InitialInstanceCount`) recommendation. This is the number of instances that you should provision in the endpoint to meet the `MaxInvocations` specified in the `StoppingConditions`. For example, if the `InstanceType` is `ml.m5.large` and the `InitialInstanceCount` is `2`, then you should provision 2 `ml.m5.large` instances for your endpoint so that it can handle the TPS specified in the `MaxInvocations` stopping condition.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) for your real-time endpoint, the maximum number of `InvokeEndpoint` requests sent to the endpoint, and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for a load test job that was configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the values you specified for `MaxConcurrency` and `MemorySizeInMB` when setting up the load test. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch compute resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ AWS CLI ]

Collect metrics with the `describe-inference-recommendations-job` API. Specify the job name of the load test for the `job-name` flag:

```
aws sagemaker describe-inference-recommendations-job --job-name <job-name>
```

This returns a response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing Serverless Inference recommendations, see the example after this one).

```
{
    'JobName': 'job-name', 
    'JobDescription': 'job-description', 
    'JobType': 'Advanced', 
    'JobArn': 'arn:aws:sagemaker:region:account-id:inference-recommendations-job/resource-id', 
    'Status': 'COMPLETED', 
    'CreationTime': datetime.datetime(2021, 10, 26, 19, 38, 30, 957000, tzinfo=tzlocal()), 
    'LastModifiedTime': datetime.datetime(2021, 10, 26, 19, 46, 31, 399000, tzinfo=tzlocal()), 
    'InputConfig': {
        'ModelPackageVersionArn': 'arn:aws:sagemaker:region:account-id:model-package/resource-id', 
        'JobDurationInSeconds': 7200, 
        'TrafficPattern': {
            'TrafficType': 'PHASES'
            }, 
        'ResourceLimit': {
            'MaxNumberOfTests': 100, 
            'MaxParallelOfTests': 100
            }, 
        'EndpointConfigurations': [{
            'InstanceType': 'ml.c5d.xlarge'
            }]
        }, 
    'StoppingConditions': {
        'MaxInvocations': 1000, 
        'ModelLatencyThresholds': [{
            'Percentile': 'P95', 
            'ValueInMilliseconds': 100
            }]
        }, 
    'InferenceRecommendations': [{
        'Metrics': {
        'CostPerHour': 0.6899999976158142, 
        'CostPerInference': 1.0332434612791985e-05, 
        'MaximumInvocations': 1113, 
        'ModelLatency': 100000
        }, 
        'EndpointConfiguration': {
            'EndpointName': 'endpoint-name', 
            'VariantName': 'variant-name', 
            'InstanceType': 'ml.c5d.xlarge', 
            'InitialInstanceCount': 3
            }, 
        'ModelConfiguration': {
            'Compiled': False, 
            'EnvironmentParameters': []
            }
        }], 
    'ResponseMetadata': {
        'RequestId': 'request-id', 
        'HTTPStatusCode': 200, 
        'HTTPHeaders': {
            'x-amzn-requestid': 'x-amzn-requestid', 
            'content-type': 'content-type', 
            'content-length': '1199', 
            'date': 'Tue, 26 Oct 2021 19:57:42 GMT'
            }, 
        'RetryAttempts': 0
        }
    }
```

The first few lines provide information about the load test job itself. This includes the job name, role ARN, creation, and deletion time. 

The `InferenceRecommendations` dictionary contains a list of Inference Recommender inference recommendations.

The `EndpointConfiguration` nested dictionary contains the instance type (`InstanceType`) recommendation along with the endpoint and variant name (a deployed AWS machine learning model) used during the recommendation job. You can use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) for more information.

The `Metrics` nested dictionary contains information about the estimated cost per hour (`CostPerHour`) for your real-time endpoint in US dollars, the estimated cost per inference (`CostPerInference`) for your real-time endpoint, the maximum number of `InvokeEndpoint` requests sent to the endpoint, and the model latency (`ModelLatency`), which is the interval of time (in microseconds) that your model took to respond to SageMaker AI. The model latency includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

The following example shows the `InferenceRecommendations` part of the response for a load test job that was configured to return serverless inference recommendations:

```
"InferenceRecommendations": [ 
      { 
         "EndpointConfiguration": { 
            "EndpointName": "value",
            "InitialInstanceCount": value,
            "InstanceType": "value",
            "VariantName": "value",
            "ServerlessConfig": {
                "MaxConcurrency": value,
                "MemorySizeInMb": value
            }
         },
         "InvocationEndTime": value,
         "InvocationStartTime": value,
         "Metrics": { 
            "CostPerHour": value,
            "CostPerInference": value,
            "CpuUtilization": value,
            "MaxInvocations": value,
            "MemoryUtilization": value,
            "ModelLatency": value,
            "ModelSetupTime": value
         },
         "ModelConfiguration": { 
            "Compiled": "False",
            "EnvironmentParameters": [],
            "InferenceSpecificationName": "value"
         },
         "RecommendationId": "value"
      }
   ]
```

You can interpret the recommendations for serverless inference similarly to the results for real-time inference, with the exception of the `ServerlessConfig`, which tells you the values you specified for `MaxConcurrency` and `MemorySizeInMB` when setting up the load test. Serverless recommendations also measure the metric `ModelSetupTime`, which measures (in microseconds) the time it takes to launch computer resources on a serverless endpoint. For more information about setting up serverless endpoints, see the [Serverless Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

------
#### [ Amazon SageMaker Studio Classic ]

The recommendations populate in a new tab called **Inference recommendations** within Studio Classic. It can take up to 2 hours for the results to show up. This tab contains **Results** and **Details** columns.

The **Details** column provides information about the load test job, such as the name given to the load test job, when the job was created (**Creation time**), and more. It also contains **Settings** information, such as the maximum number of invocation that occurred per minute and information about the Amazon Resource Names used.

The **Results** column provides ** Deployment goals** and **SageMaker AI recommendations** windows in which you can adjust the order in which results are displayed based on deployment importance. There are three dropdown menus in which you can provide the level of importance of the **Cost**, **Latency**, and **Throughput** for your use case. For each goal (cost, latency, and throughput), you can set the level of importance: **Lowest Importance**, **Low Importance**, **Moderate importance**, **High importance**, or **Highest importance**. 

Based on your selections of importance for each goal, Inference Recommender displays its top recommendation in the **SageMaker recommendation** field on the right of the panel, along with the estimated cost per hour and inference request. It also provides Information about the expected model latency, maximum number of invocations, and the number of instances.

In addition to the top recommendation displayed, you can also see the same information displayed for all instances that Inference Recommender tested in the **All runs** section.

------
#### [ SageMaker AI console ]

You can view your custom load test job results in the SageMaker AI console by doing the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, choose the name of your inference recommendation job.

On the details page for your job, you can view the **Inference recommendations**, which are the instance types SageMaker AI recommends for your model, as shown in the following screenshot.

![\[Screenshot of the inference recommendations list on the job details page in the SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inf-rec-instant-recs.png)


In this section, you can compare the instance types by various factors such as **Model latency**, **Cost per hour**, **Cost per inference**, and **Invocations per minute**.

On this page, you can also view the configurations you specified for your job. In the **Monitor** section, you can view the Amazon CloudWatch metrics that were logged for each instance type. To learn more about interpreting these metrics, see [Interpret results](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html).

------

# Stop your load test
<a name="load-test-stop"></a>

You might want to stop a job that is currently running if you began a job by mistake or no longer need to run the job. Stop your load test jobs programmatically with the `StopInferenceRecommendationsJob` API, or through Studio Classic or the SageMaker AI console.

------
#### [ AWS SDK for Python (Boto3) ]

Specify the job name of the load test for the `JobName` field:

```
sagemaker_client.stop_inference_recommendations_job(
                                    JobName='<INSERT>'
                                    )
```

------
#### [ AWS CLI ]

Specify the job name of the load test for the `job-name` flag:

```
aws sagemaker stop-inference-recommendations-job --job-name <job-name>
```

------
#### [ Amazon SageMaker Studio Classic ]

Close the tab where you initiated your custom load job to stop your Inference Recommender load test.

------
#### [ SageMaker AI console ]

To stop your load test job through the SageMaker AI console, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**, and then choose **Inference recommender**.

1. On the **Inference recommender jobs** page, select your load test job.

1. Choose **Stop job**.

1. In the dialog box that pops up, choose **Confirm**.

After stopping your job, the job’s **Status** should change to **Stopping**.

------

# Troubleshoot Inference Recommender errors
<a name="inference-recommender-troubleshooting"></a>

This section contains information about how to understand and prevent common errors, the error messages they generate, and guidance on how to resolve these errors.

## How to troubleshoot
<a name="inference-recommender-troubleshooting-how-to"></a>

You can attempt to resolve your error by going through the following steps:
+ Check if you've covered all the prerequisites to use Inference Recommender. See the [Inference Recommender Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-prerequisites.html).
+ Check that you are able to deploy your model from Model Registry to an endpoint and that it can process your payloads without errors. See [Deploy a Model from the Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-deploy.html).
+ When you kick off an Inference Recommender job, you should see endpoints being created in the console and you can review the CloudWatch logs.

## Common errors
<a name="inference-recommender-troubleshooting-common"></a>

Review the following table for common Inference Recommender errors and their solutions.


| Error | Solution | 
| --- | --- | 
|  Specify `Domain` in the Model Package version 1. `Domain` is a mandatory parameter for the job.  |  Make sure you provide the ML domain or `OTHER` if unknown.  | 
|  Provided role ARN cannot be assumed and an `AWSSecurityTokenServiceException` error occurred.  |  Make sure the execution role provided has the necessary permissions specified in the prerequisites.  | 
|  Specify `Framework` in the Model Package version 1.`Framework` is a mandatory parameter for the job.  |  Make sure you provide the ML Framework or `OTHER` if unknown.  | 
|  Users at the end of prev phase is 0 while initial users of current phase is 1.  |  Users here refers to virtual users or threads used to send requests. Each phase starts with A users and ends with B users such that B > A. Between sequential phases, x\$11 and x\$12, we require that abs(x\$12.A - x\$11.B) <= 3 and >= 0.  | 
|  Total Traffic duration (across) should not be more than Job duration.  |  The total duration of all your Phases cannot exceed the Job duration.  | 
|  Burstable instance type ml.t2.medium is not allowed.  |  Inference Recommender doesn't support load testing on t2 instance family because burstable instances do not provide consistent performance.  | 
|  ResourceLimitExceeded when calling CreateEndpoint operation  |  You have exceeded a SageMaker AI resource limit. For example, Inference Recommender might be unable to provision endpoints for benchmarking if the account has reached the endpoint quota. For more information about SageMaker AI limits and quotas, see [Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).  | 
|  ModelError when calling InvokeEndpoint operation  |  A model error can happen for the following reasons: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html)  | 
|  PayloadError when calling InvokeEndpoint operation  |  A payload error can happen for following reasons: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-troubleshooting.html)  | 

## Check CloudWatch
<a name="inference-recommender-troubleshooting-check-cw"></a>

When you kick off an Inference Recommender job, you should see endpoints being created in the console. Select one of the endpoints and view the CloudWatch logs to monitor for any 4xx/5xx errors. If you have a successful Inference Recommender job, you will be able to see the endpoint names as part of the results. Even if your Inference Recommender job is unsuccessful, you can still check the CloudWatch logs for the deleted endpoints by following the steps below:

1. Open the Amazon CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Select the Region in which you created the Inference Recommender job from the **Region** dropdown list in the top right.

1. In the navigation pane of CloudWatch, choose **Logs**, and then select **Log groups**.

1. Search for the log group called `/aws/sagemaker/Endpoints/sm-epc-*`. Select the log group based on your most recent Inference Recommender job.

You can also troubleshoot your job by checking the Inference Recommender CloudWatch logs. The Inference Recommender logs, which are published in the `/aws/sagemaker/InferenceRecommendationsJobs` CloudWatch log group, give a high level view on the progress of the job in the `<jobName>/execution` log stream. You can find detailed information on each of the endpoint configurations being tested in the `<jobName>/Endpoint/<endpointName>` log stream.

**Overview of the Inference Recommender log streams**
+ `<jobName>/execution` contains overall job information such as endpoint configurations scheduled for benchmarking, compilation job skip reason, and validation failure reason.
+ `<jobName>/Endpoint/<endpointName>` contains information such as resource creation progress, test configuration, load test stop reason, and resource cleanup status.
+ `<jobName>/CompilationJob/<compilationJobName>` contains information on compilation jobs created by Inference Recommender, such as the compilation job configuration and compilation job status.

**Create an alarm for Inference Recommender error messages**

Inference Recommender outputs log statements for errors that might be helpful while troubleshooting. With a CloudWatch log group and a metric filter, you can look for terms and patterns in this log data as the data is sent to CloudWatch. Then, you can create a CloudWatch alarm based on the log group-metric filter. For more information, see [Create a CloudWatch alarm based on a log group-metric filter](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_alarm_log_group_metric_filter.html).

## Check benchmarks
<a name="inference-recommender-troubleshooting-check-benchmarks"></a>

When you kick off an Inference Recommender job, Inference Recommender creates several benchmarks to evaluate the performance of your model on different instance types. You can use the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API to view the details for all the benchmarks. If you have a failed benchmark, you can see the failure reasons as part of the results.

To use the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API, provide the following values:
+ For `JobName`, provide the name of the Inference Recommender job.
+ For `StepType`, use `BENCHMARK` to return details about the job's benchmarks.
+ For `Status`, use `FAILED` to return details about only the failed benchmarks. For a list of the other status types, see the `Status` field in the [ListInferenceRecommendationsJobSteps](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListInferenceRecommendationsJobSteps.html) API.

```
# Create a low-level SageMaker service client.
import boto3
aws_region = '<region>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide the job name for the SageMaker Inference Recommender job
job_name = '<job-name>'

# Filter for benchmarks
step_type = 'BENCHMARK' 

# Filter for benchmarks that have a FAILED status
status = 'FAILED'

response = sagemaker_client.list_inference_recommendations_job_steps(
    JobName = job_name,
    StepType = step_type,
    Status = status
)
```

You can print the response object to view the results. The preceding code example stored the response in a variable called `response`:

```
print(response)
```