Overview Supported models Prerequisites Step 1: Deploy a SageMaker endpoint Step 2: Install evaluation dependencies Step 3: Configure AWS credentials Step 4: Install the SageMaker provider Step 5: Download evaluation benchmarks Step 6: Run evaluations Step 7: View results Managing endpoints Onboarding custom benchmarks Troubleshooting Related resources

Evaluate Models Hosted on SageMaker Inference

This guide explains how to evaluate your customized Amazon Nova models deployed on SageMaker inference endpoints using Inspect AI, an open-source evaluation framework.

Note

For a hands-on walkthrough, see the SageMaker Inspect AI quickstart notebook.

Overview

You can evaluate your customized Amazon Nova models deployed on SageMaker endpoints using standardized benchmarks from the AI research community. This approach enables you to:

Evaluate customized Amazon Nova models (fine-tuned, distilled, or otherwise adapted) at scale
Run evaluations with parallel inference across multiple endpoint instances
Compare model performance using benchmarks like MMLU, TruthfulQA, and HumanEval
Integrate with your existing SageMaker infrastructure

Supported models

The SageMaker inference provider works with:

Amazon Nova models (Nova Micro, Nova Lite, Nova Lite 2)
Models deployed via vLLM or OpenAI-compatible inference servers
Any endpoint that supports the OpenAI Chat Completions API format

Prerequisites

Before you begin, ensure you have:

An AWS account with permissions to create and invoke SageMaker endpoints
AWS credentials configured via AWS CLI, environment variables, or IAM role
Python 3.9 or higher

Required IAM permissions

Your IAM user or role needs the following permissions:


{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:*:*:endpoint/*"
    }
  ]
}

Step 1: Deploy a SageMaker endpoint

Before running evaluations, you need a SageMaker inference endpoint running your model.

For instructions on creating a SageMaker inference endpoint with Amazon Nova models, see Getting Started.

Once your endpoint is in InService status, note the endpoint name for use in the evaluation commands.

Step 2: Install evaluation dependencies

Create a Python virtual environment and install the required packages.


# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install uv for faster package installation
pip install uv

# Install Inspect AI and evaluation benchmarks
uv pip install inspect-ai inspect-evals

# Install AWS dependencies
uv pip install aioboto3 boto3 botocore openai

Step 3: Configure AWS credentials

Choose one of the following authentication methods:

Option 1: AWS CLI (Recommended)


aws configure

Enter your AWS Access Key ID, Secret Access Key, and default region when prompted.

Option 2: Environment variables


export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

Option 3: IAM role

If running on Amazon EC2 or SageMaker notebooks, the instance's IAM role is used automatically.

Verify credentials


import boto3

sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Account: {identity['Account']}")
print(f"User/Role: {identity['Arn']}")

Step 4: Install the SageMaker provider

The SageMaker provider enables Inspect AI to communicate with your SageMaker endpoints. The provider installation process is streamlined in the quickstart notebook.

Step 5: Download evaluation benchmarks

Clone the Inspect Evals repository to access standard benchmarks:


git clone https://github.com/UKGovernmentBEIS/inspect_evals.git

This repository includes benchmarks such as:

MMLU and MMLU-Pro (knowledge and reasoning)
TruthfulQA (truthfulness)
HumanEval (code generation)
GSM8K (mathematical reasoning)

Step 6: Run evaluations

Run an evaluation using your SageMaker endpoint:


cd inspect_evals/src/inspect_evals/

inspect eval mmlu_pro/mmlu_pro.py \
  --model sagemaker/my-nova-endpoint \
  -M region_name=us-west-2 \
  --max-connections 256 \
  --max-retries 100 \
  --display plain

Key parameters

Parameter	Default	Description
`--max-connections`	10	Number of parallel requests to the endpoint. Scale with instance count (e.g., 10 instances × 25 = 250).
`--max-retries`	3	Retry attempts for failed requests. Use 50-100 for large evaluations.
`-M region_name`	us-east-1	AWS region where your endpoint is deployed.
`-M read_timeout`	600	Request timeout in seconds.
`-M connect_timeout`	60	Connection timeout in seconds.

Tuning recommendations

For a multi-instance endpoint:


# 10-instance endpoint example
--max-connections 250   # ~25 connections per instance
--max-retries 100       # Handle transient errors

Setting --max-connections too high may overwhelm the endpoint and cause throttling. Setting it too low underutilizes capacity.

Step 7: View results

Launch the Inspect AI viewer to analyze evaluation results:


inspect view

The viewer displays:

Overall scores and metrics
Per-sample results with model responses
Error analysis and failure patterns

Managing endpoints

Update an endpoint

To update an existing endpoint with a new model or configuration:


import boto3

sagemaker = boto3.client('sagemaker', region_name=REGION)

# Create new model and endpoint configuration
# Then update the endpoint
sagemaker.update_endpoint(
    EndpointName=EXISTING_ENDPOINT_NAME,
    EndpointConfigName=NEW_ENDPOINT_CONFIG_NAME
)

Delete an endpoint


sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)

Onboarding custom benchmarks

You can add new benchmarks to Inspect AI using the following workflow:

Study the benchmark's dataset format and evaluation metrics
Review similar implementations in inspect_evals/
Create a task file that converts dataset records to Inspect AI samples
Implement appropriate solvers and scorers
Validate with a small test run

Example task structure:


from inspect_ai import Task, task
from inspect_ai.dataset import hf_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def my_benchmark():
    return Task(
        dataset=hf_dataset("dataset_name", split="test"),
        solver=multiple_choice(),
        scorer=choice()
    )

Troubleshooting

Common issues

Endpoint throttling or timeouts

Reduce --max-connections
Increase --max-retries
Check endpoint CloudWatch metrics for capacity issues

Authentication errors

Verify AWS credentials are configured correctly
Check IAM permissions include sagemaker:InvokeEndpoint

Model errors

Verify the endpoint is in InService status
Check that the model supports the OpenAI Chat Completions API format

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

API reference

Amazon Bedrock inference