Evaluate the performance of RAG sources using Amazon Bedrock evaluations

You can use computed metrics to evaluate how effectively a Retrieval Augmented Generation (RAG) system retrieves relevant information from your data sources, and how effective the generated responses are in answering questions. The results of a RAG evaluation allow you to compare different Amazon Bedrock Knowledge Bases and other RAG sources, and then to choose the best Knowledge Base or RAG system for your application.

You can set up two different types of RAG evaluation jobs.

Retrieve only – In a retrieve-only RAG evaluation job, the report is based on the data retrieved from your RAG source. You can either evaluate an Amazon Bedrock Knowledge Base, or you can bring your own inference response data from an external RAG source.
Retrieve and generate – In a retrieve-and-generate RAG evaluation job, the report is based on the data retrieved from your knowledge base and the summaries generated by the response generator model. You can either use an Amazon Bedrock Knowledge Base and response generator model, or you can bring your own inference response data from an external RAG source.

Supported models

To create a RAG evaluation job, you need access to at least one of the evaluator models in the following lists. To create a retrieve-and-generate job that uses an Amazon Bedrock model to generate the responses, you also need access to at least one of the listed generator response models.

To learn more about gaining access to models and Region availability, see Access Amazon Bedrock foundation models.

Supported evaluator models (built-in metrics)

Amazon Nova Pro – amazon.nova-pro-v1:0
Anthropic Claude 3.5 Sonnet v1 – anthropic.claude-3-5-sonnet-20240620-v1:0
Anthropic Claude 3.5 Sonnet v2 – anthropic.claude-3-5-sonnet-20241022-v2:0
Anthropic Claude 3.7 Sonnet – anthropic.claude-3-7-sonnet-20250219-v1:0
Anthropic Claude 3 Haiku – anthropic.claude-3-haiku-20240307-v1:0
Anthropic Claude 3.5 Haiku – anthropic.claude-3-5-haiku-20241022-v1:0
Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0
Mistral Large – mistral.mistral-large-2402-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported evaluator models (custom metrics)

Mistral Large 24.02 – mistral.mistral-large-2402-v1:0
Mistral Large 24.07 – mistral.mistral-large-2407-v1:0
Anthropic Claude 3.5 Sonnet v1 – anthropic.claude-3-5-sonnet-20240620-v1:0
Anthropic Claude 3.5 Sonnet v2 – anthropic.claude-3-5-sonnet-20241022-v2:0
Anthropic Claude 3.7 Sonnet – anthropic.claude-3-7-sonnet-20250219-v1:0
Anthropic Claude 3 Haiku 3 – anthropic.claude-3-haiku-20240307-v1:0
Anthropic Claude 3 Haiku 3.5 – anthropic.claude-3-5-haiku-20241022-v1:0
Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0
Meta Llama 3.3 70B Instruct – meta.llama3-3-70b-instruct-v1:0
Amazon Nova Pro – amazon.nova-pro-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported response generator models

You can use the following model types in Amazon Bedrock as the response generator model in an evaluation job. You can also bring your own inference response data from non-Amazon Bedrock models.

Foundation models – Amazon Bedrock foundation model information
Amazon Bedrock Marketplace models – Amazon Bedrock Marketplace
Customized foundation models – Customize your model to improve its performance for your use case
Imported foundation models – Use Custom model import to import a customized open-source model into Amazon Bedrock
Prompt routers – Understanding intelligent prompt routing in Amazon Bedrock
Models for which you have purchased Provisioned Throughput – Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Stop a job

Prompt datasets