

# How it works
How it works

Amazon Bedrock AgentCore Evaluations provides capabilities to assess the performance of AI agents. It can compute metrics such as an agent’s end-to-end task completion (goal attainment) correctness, the accuracy of a tool invoked by the agent while handling a user request, and any custom metric defined to evaluate specific dimensions of an agent’s behavior. The AgentCore Evaluations can evaluate the AI agents that are hosted under AgentCore Runtime as well as AI agents hosted outside of AgentCore.

You can create and manage evaluation or relevant resources using the AgentCore CLI, the AgentCore Python SDK, the AWS Management Console or directly through AWS SDKs.

**Topics**
+ [

# Evaluation terminology
](evaluations-terminology.md)
+ [

# Evaluators
](evaluators.md)
+ [

# Evaluation types
](evaluations-types.md)

# Evaluation terminology
Evaluation terminology

Understanding key concepts and terminology is essential for effectively using AgentCore Evaluations. The following terms define the core components and processes involved in agent evaluation.

**Topics**
+ [

## Agent Framework
](#agent-framework)
+ [

## Instrumentation Library
](#instrumentation-library)
+ [

## Instrumentation Agent
](#instrumentation-agent)
+ [

## Session
](#session)
+ [

## Trace
](#trace)
+ [

## Tool Call
](#tool-call)
+ [

## Reference Free Large Language Models (LLMs) as judges
](#llms-as-judges)

## Agent Framework


An agent framework provides the foundational components for building, orchestrating, and running agent-based applications. Frameworks define structures such as steps, tools, control flow, and memory management. Common industry frameworks include [Strands Agents](https://strandsagents.com/latest/) , **LangGraph** etc. These frameworks help standardize how agents are constructed and make them easier to instrument and evaluate.

AgentCore Evaluations currently supports Strands Agents and **LangGraph** agent frameworks.

## Instrumentation Library


An instrumentation library records telemetry generated by your agent during execution. This telemetry can include traces, spans, tool calls, model invocations, and intermediate steps. Libraries such as **OpenTelemetry** and **OpenInference** offer standardized APIs and semantic conventions that allow you to capture agent behavior with minimal code changes. Instrumentation is required for trace collection and evaluation.

AgentCore Evaluations currently supports **OpenTelemetry** and **OpenInference** as instrumentation libraries.

## Instrumentation Agent


An instrumentation agent automatically captures telemetry from application code, processes it, and exports it to a backend service for storage or evaluation. Tools such as **ADOT (AWS Distro for OpenTelemetry)** provide a vendor-neutral, production-ready auto-instrumentation agent that dynamically injects bytecode to capture traces without code changes. The agent is a key component in enabling automated evaluation.

AgentCore Evaluations currently supports **ADOT (AWS Distro for OpenTelemetry)** as the instrumentation agent.

## Session


A session represents a logical grouping of related interactions from a single user or workflow. A session may contain one or more traces. Sessions help you view and evaluate agent behavior across multi-step interactions, rather than focusing on individual requests.

## Trace


A trace is a complete record of a single agent execution or request. A trace contains one or more spans, which represent the individual operations performed during that execution. Traces provide end-to-end visibility into agent decisions and tool usage.

## Tool Call


A tool call is a span that represents an agent’s invocation of an external function, API, or capability. Tool call spans typically capture information such as the tool name, input parameters, execution time, and output. Tool call details are used to evaluate whether the agent selected and used tools correctly and efficiently.

## Reference Free Large Language Models (LLMs) as judges


Large Language Models (LLMs) as judges refers to an evaluation method that uses a large language model (LLM) to automatically assess the quality, correctness, or effectiveness of an agent or another model’s output. Instead of relying on manual review or rule-based checks, the LLM is prompted with evaluation criteria and produces a score, label, or explanation based on the input and output being evaluated. Unlike traditional evaluations that rely on ground-truth data, LLM-as-a-judge methods rely on the model’s internal knowledge to make judgments. This approach enables scalable, consistent, and customizable qualitative assessments, such as correctness, reasoning quality, or instruction adherence, across large numbers of agent interactions or model responses.

# Evaluators
Evaluators

Evaluators are the core components that assess your agent’s performance across different dimensions. They analyze agent traces and provide quantitative scores based on specific criteria such as helpfulness, accuracy, or custom business metrics. AgentCore Evaluations offers both built-in evaluators for common use cases and the flexibility to create custom evaluators tailored to your specific requirements.

**Topics**
+ [

## Built-in evaluators
](#built-in-evaluators)
+ [

## Custom evaluators
](#custom-evaluators-hiw)

## Built-in evaluators


Built-in evaluators are pre-configured solutions that use Large Language Models (LLMs) as judges to evaluate agent performance. These evaluators come with predefined configurations, including carefully crafted prompt templates, selected evaluator models, and standardized scoring criteria.

Built-in evaluators are designed to address common evaluation needs while ensuring consistency and reliability across assessments. Because they are part of our fully managed offering, you can use them immediately without any additional configuration, and we will continue improving their quality and adding new evaluators over time. To preserve consistency and reliability, the configurations of built-in evaluators cannot be modified.

## Custom evaluators


Custom evaluators offer more flexibility by allowing you to define all aspects of your evaluation process. AgentCore Evaluations supports two types of custom evaluators:
+  **LLM-as-a-judge evaluators** – Define your own evaluator model, evaluation instructions, and scoring schemas. You can tailor the evaluation to your specific needs by selecting the evaluator model, crafting custom evaluation instructions, defining specific evaluation criteria, and designing your own scoring schema. For more information, see [Custom evaluators](custom-evaluators.md).
+  **Code-based evaluators** – Use your own AWS Lambda function to programmatically evaluate agent performance. This approach gives you full control over the evaluation logic, enabling deterministic checks, external API calls, regex matching, custom metrics, or any business-specific rules without relying on an LLM judge. For more information, see [Custom code-based evaluator](code-based-evaluators.md).

This level of customization is particularly valuable when you need to evaluate domain-specific agents, apply unique quality standards, or implement specialized scoring systems. For example, you might create custom evaluation criteria for specific industries like healthcare or finance, or design scoring schemas that align with your organization’s quality metrics.

# Evaluation types
Evaluation types

AgentCore Evaluations provides two evaluation types, which differ in when and how the evaluation is performed:

**Topics**
+ [

## Online evaluation
](#online-evaluation-type)
+ [

## On-demand evaluation
](#on-demand-evaluation-type)

## Online evaluation


Online evaluation continuously monitors the quality of deployed agents using live production traffic. Unlike one-off evaluation in development environments, it provides continuous performance assessment across multiple criteria, enabling persistent monitoring in production.

Online evaluation consists of three main components. First, session sampling and filtering allows you to configure specific rules to evaluate agent interactions. You can set percentage-based sampling to evaluate a portion of all sessions (for example, 10%) or define conditional filters for more targeted evaluation. Second, you can choose from multiple evaluation methods including creating new [Custom evaluators](custom-evaluators.md) , using existing custom evaluators, or selecting from [Built-in evaluators](evaluators.md#built-in-evaluators) . Finally, the monitoring and analysis capabilities lets you view aggregated scores in dashboards, track quality trends over time, investigate low-scoring sessions, and analyze complete interaction flows from input to output.

## On-demand evaluation


On-demand evaluation provides a flexible way to evaluate specific agent interactions by directly analyzing a chosen set of spans. Unlike online evaluation which continuously monitors production traffic, on-demand evaluation lets you perform targeted assessments of selected interactions at any time.

With on-demand evaluation, you specify the exact spans or traces you want to evaluate by providing their span or trace IDs. You can then apply the same comprehensive evaluation methods available in online evaluation, including [Custom evaluators](custom-evaluators.md) or [Built-in evaluators](evaluators.md#built-in-evaluators) . This evaluation type is particularly useful when you need to try out your own custom evaluator, investigate specific customer interactions, validate fixes for reported issues, or analyze historical data for quality improvements. Once you submit the evaluation request, the service processes only the spans and traces you specify and returns detailed results for your analysis.

This evaluation type complements online evaluation by offering precise control over which interactions to assess, making it an effective tool for focused quality analysis and issue investigation. It is also well suited for early stages of the agent development lifecycle, such as build-time testing.