Evaluate the performance of RAG sources using Amazon Bedrock evaluations - Amazon Bedrock

Evaluate the performance of RAG sources using Amazon Bedrock evaluations

You can use computed metrics to evaluate how effectively a Retrieval Augmented Generation (RAG) system retrieves relevant information from your data sources, and how effective the generated responses are in answering questions. The results of a RAG evaluation allow you to compare different Amazon Bedrock Knowledge Bases and other RAG sources, and then to choose the best Knowledge Base or RAG system for your application.

You can set up two different types of RAG evaluation jobs.

  • Retrieve only – In a retrieve-only RAG evaluation job, the report is based on the data retrieved from your RAG source. You can either evaluate an Amazon Bedrock Knowledge Base, or you can bring your own inference response data from an external RAG source.

  • Retrieve and generate – In a retrieve-and-generate RAG evaluation job, the report is based on the data retrieved from your knowledge base and the summaries generated by the response generator model. You can either use an Amazon Bedrock Knowledge Base and response generator model, or you can bring your own inference response data from an external RAG source.

Supported models

To create a RAG evaluation job, you need access to at least one of the evaluator models in the following lists. To create a retrieve-and-generate job that uses an Amazon Bedrock model to generate the responses, you also need access to at least one of the listed generator response models.

To learn more about gaining access to models and Region availability, see Access Amazon Bedrock foundation models.

Supported evaluator models (built-in metrics)

  • Amazon Nova Pro – amazon.nova-pro-v1:0

  • Anthropic Claude 3.5 Sonnet v1 – anthropic.claude-3-5-sonnet-20240620-v1:0

  • Anthropic Claude 3.5 Sonnet v2 – anthropic.claude-3-5-sonnet-20241022-v2:0

  • Anthropic Claude 3.7 Sonnet – anthropic.claude-3-7-sonnet-20250219-v1:0

  • Anthropic Claude 3 Haiku – anthropic.claude-3-haiku-20240307-v1:0

  • Anthropic Claude 3.5 Haiku – anthropic.claude-3-5-haiku-20241022-v1:0

  • Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0

  • Mistral Large – mistral.mistral-large-2402-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported evaluator models (custom metrics)

  • Mistral Large 24.02 – mistral.mistral-large-2402-v1:0

  • Mistral Large 24.07 – mistral.mistral-large-2407-v1:0

  • Anthropic Claude 3.5 Sonnet v1 – anthropic.claude-3-5-sonnet-20240620-v1:0

  • Anthropic Claude 3.5 Sonnet v2 – anthropic.claude-3-5-sonnet-20241022-v2:0

  • Anthropic Claude 3.7 Sonnet – anthropic.claude-3-7-sonnet-20250219-v1:0

  • Anthropic Claude 3 Haiku 3 – anthropic.claude-3-haiku-20240307-v1:0

  • Anthropic Claude 3 Haiku 3.5 – anthropic.claude-3-5-haiku-20241022-v1:0

  • Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0

  • Meta Llama 3.3 70B Instruct – meta.llama3-3-70b-instruct-v1:0

  • Amazon Nova Pro – amazon.nova-pro-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported response generator models

You can use the following model types in Amazon Bedrock as the response generator model in an evaluation job. You can also bring your own inference response data from non-Amazon Bedrock models.