Dataset evaluations
Dataset evaluations let you run your agent against a predefined set of scenarios and
automatically evaluate the results. Instead of manually invoking your agent and collecting
spans, the OnDemandEvaluationDatasetRunner from the AgentCore SDK orchestrates the
entire lifecycle — invoke the agent, wait for telemetry ingestion, collect spans, and call
the Evaluate API — in a single run() call.
This is useful for regression testing, benchmark datasets, and CI/CD pipelines where you want to evaluate agent quality across many scenarios automatically.
Note
Dataset evaluations support all AgentCore evaluators — all built-in evaluators across session, trace, and tool-call levels, as well as custom evaluators. The runner automatically handles level-aware request construction, batching, and ground truth mapping for whichever evaluators you configure.
Topics
How it works
The runner processes scenarios in three phases:
-
Invoke — All scenarios run concurrently using a thread pool. Each scenario gets a unique session ID, and turns within a scenario execute sequentially to maintain conversation context.
-
Wait — A configurable delay (default: 180 seconds) allows CloudWatch to ingest the telemetry data. This delay is paid once, not per-scenario.
-
Evaluate — Spans are collected from CloudWatch and evaluation requests are built for each evaluator. Ground truth fields from the dataset (
expected_response,assertions,expected_trajectory) are automatically mapped to the correct API reference inputs.
Prerequisites
Python 3.10+
-
An agent deployed on AgentCore Runtime with observability enabled, or an agent built with a supported framework configured with AgentCore Observability. Supported frameworks:
Strands Agents
LangGraph with
opentelemetry-instrumentation-langchainoropeninference-instrumentation-langchain
Transaction Search enabled in CloudWatch — see Enable Transaction Search
The AgentCore SDK installed:
pip install bedrock-agentcoreAWS credentials configured with permissions for
bedrock-agentcore,bedrock-agentcore-control, andlogs(CloudWatch)
The following constants are used throughout the examples. Replace them with your own values:
REGION = "<region-code>" AGENT_ARN = "arn:aws:bedrock-agentcore:<region-code>:<account-id>:runtime/<agent-id>" LOG_GROUP = "/aws/bedrock-agentcore/runtimes/<agent-id>-DEFAULT"
Dataset schema
A dataset contains one or more scenarios. Each scenario represents a conversation (session) with the agent. Scenarios can be single-turn or multi-turn.
{ "scenarios": [ { "scenario_id": "math-question", "turns": [ { "input": "What is 15 + 27?", "expected_response": "15 + 27 = 42" } ], "expected_trajectory": ["calculator"], "assertions": ["Agent used the calculator tool to compute the result"] } ] }
| Field | Required | Scope | Description |
|---|---|---|---|
scenario_id |
Yes | — | Unique identifier for the scenario. |
turns |
Yes | — | List of turns in the conversation. Each turn has input (required) and expected_response (optional). |
expected_trajectory |
No | Session | Expected sequence of tool names. Used by trajectory evaluators. |
assertions |
No | Session | Natural language assertions about expected behavior. Used by Builtin.GoalSuccessRate. |
| Field | Required | Description |
|---|---|---|
input |
Yes | The prompt sent to the agent for this turn. Can be a string or a dict. |
expected_response |
No | The expected agent response for this turn. Mapped positionally to the trace produced by this turn. |
The runner automatically maps dataset fields to the Evaluate API's
evaluationReferenceInputs:
expected_responseon each turn maps positionally to traces — turn 0 → trace 0, turn 1 → trace 1, and so on.assertionsandexpected_trajectoryare scoped to the session level.If no ground truth fields are present,
evaluationReferenceInputsis omitted from the API request.
Single-turn example
A single-turn dataset has one turn per scenario. This is the simplest form — each scenario sends one prompt and checks the response.
Save the following as dataset.json:
{ "scenarios": [ { "scenario_id": "math-question", "turns": [ { "input": "What is 15 + 27?", "expected_response": "15 + 27 = 42" } ], "expected_trajectory": ["calculator"], "assertions": ["Agent used the calculator tool to compute the result"] }, { "scenario_id": "weather-check", "turns": [ { "input": "What's the weather?", "expected_response": "The weather is sunny" } ], "expected_trajectory": ["weather"], "assertions": ["Agent used the weather tool"] } ] }
Run the evaluation:
import json import boto3 from bedrock_agentcore.evaluation import ( OnDemandEvaluationDatasetRunner, EvaluationRunConfig, EvaluatorConfig, FileDatasetProvider, CloudWatchAgentSpanCollector, AgentInvokerInput, AgentInvokerOutput, ) # Load dataset dataset = FileDatasetProvider("dataset.json").get_dataset() # Define the agent invoker agentcore_client = boto3.client("bedrock-agentcore", region_name=REGION) def agent_invoker(invoker_input: AgentInvokerInput) -> AgentInvokerOutput: payload = invoker_input.payload if isinstance(payload, str): payload = json.dumps({"prompt": payload}).encode() elif isinstance(payload, dict): payload = json.dumps(payload).encode() response = agentcore_client.invoke_agent_runtime( agentRuntimeArn=AGENT_ARN, runtimeSessionId=invoker_input.session_id, payload=payload, ) response_body = response["response"].read() return AgentInvokerOutput(agent_output=json.loads(response_body)) # Create span collector span_collector = CloudWatchAgentSpanCollector( log_group_name=LOG_GROUP, region=REGION, ) # Configure evaluators config = EvaluationRunConfig( evaluator_config=EvaluatorConfig( evaluator_ids=[ "Builtin.GoalSuccessRate", "Builtin.TrajectoryExactOrderMatch", "Builtin.TrajectoryInOrderMatch", "Builtin.TrajectoryAnyOrderMatch", "Builtin.Correctness", "Builtin.Helpfulness", "Builtin.ToolSelectionAccuracy" ], ), evaluation_delay_seconds=180, max_concurrent_scenarios=5, ) # Run runner = OnDemandEvaluationDatasetRunner(region=REGION) result = runner.run( agent_invoker=agent_invoker, dataset=dataset, span_collector=span_collector, config=config, ) print(f"Completed: {len(result.scenario_results)} scenario(s)")
Process results:
for scenario in result.scenario_results: print(f"\nScenario: {scenario.scenario_id} ({scenario.status})") if scenario.error: print(f" Error: {scenario.error}") continue for evaluator in scenario.evaluator_results: print(f" {evaluator.evaluator_id}:") for r in evaluator.results: print(f" Score: {r.get('value')}, Label: {r.get('label')}") ignored = r.get("ignoredReferenceInputFields", []) if ignored: print(f" Ignored fields: {ignored}")
To save results to a file:
with open("results.json", "w") as f: f.write(result.model_dump_json(indent=2))
Multi-turn example
Multi-turn scenarios have multiple turns per scenario. Turns execute sequentially within
the same session, maintaining conversation context. Each turn can have its own
expected_response, while assertions and expected_trajectory apply to the entire
session.
Save the following as multi_turn_dataset.json:
{ "scenarios": [ { "scenario_id": "math-then-weather", "turns": [ { "input": "What is 15 + 27?", "expected_response": "15 + 27 = 42" }, { "input": "What's the weather?", "expected_response": "The weather is sunny" } ], "expected_trajectory": ["calculator", "weather"], "assertions": [ "Agent used the calculator tool for the math question", "Agent used the weather tool when asked about weather" ] } ] }
Run the evaluation:
dataset = FileDatasetProvider("multi_turn_dataset.json").get_dataset() result = runner.run( agent_invoker=agent_invoker, dataset=dataset, span_collector=span_collector, config=config, ) for scenario in result.scenario_results: print(f"Scenario: {scenario.scenario_id} ({scenario.status})") for evaluator in scenario.evaluator_results: for r in evaluator.results: trace = r.get("context", {}).get("spanContext", {}).get("traceId", "session") print(f" {evaluator.evaluator_id} [{trace}]: {r.get('value')} ({r.get('label')})")
Inline dataset construction
Instead of loading from a JSON file, you can construct datasets directly in Python:
from bedrock_agentcore.evaluation import Dataset, PredefinedScenario, Turn dataset = Dataset( scenarios=[ PredefinedScenario( scenario_id="math-question", turns=[ Turn( input="What is 15 + 27?", expected_response="15 + 27 = 42", ), ], expected_trajectory=["calculator"], assertions=["Agent used the calculator tool"], ), PredefinedScenario( scenario_id="weather-check", turns=[ Turn(input="What's the weather?"), ], expected_trajectory=["weather"], ), ] )
Components reference
The runner requires four components:
Agent invoker
A Callable[[AgentInvokerInput], AgentInvokerOutput] that invokes your agent for a
single turn. The runner calls this once per turn in each scenario.
| Field | Type | Description |
|---|---|---|
AgentInvokerInput.payload |
str or dict |
The turn input from the dataset. |
AgentInvokerInput.session_id |
str |
Stable across all turns in a scenario. Pass this to your agent to maintain conversation context. |
AgentInvokerOutput.agent_output |
Any |
The agent's response. |
The invoker is framework-agnostic — you can call your agent via boto3
invoke_agent_runtime, a direct function call, HTTP request, or any other method.
Span collector
An AgentSpanCollector that retrieves telemetry spans after agent invocation. The SDK
ships CloudWatchAgentSpanCollector:
from bedrock_agentcore.evaluation import CloudWatchAgentSpanCollector span_collector = CloudWatchAgentSpanCollector( log_group_name="/aws/bedrock-agentcore/runtimes/<agent-id>-DEFAULT", region=REGION, )
The collector queries two CloudWatch log groups (aws/spans for structural spans and the
agent's log group for conversation content), polls until spans appear, and returns them as
a flat list.
Evaluation config
from bedrock_agentcore.evaluation import EvaluationRunConfig, EvaluatorConfig config = EvaluationRunConfig( evaluator_config=EvaluatorConfig( evaluator_ids=["Builtin.Correctness", "Builtin.GoalSuccessRate"], ), evaluation_delay_seconds=180, # Wait for CloudWatch ingestion (default: 180) max_concurrent_scenarios=5, # Thread pool size (default: 5) )
| Field | Default | Description |
|---|---|---|
evaluator_config.evaluator_ids |
— | List of evaluator IDs (built-in names or custom evaluator IDs). |
evaluation_delay_seconds |
180 | Seconds to wait after invocation for CloudWatch to ingest spans. Set to 0 if using a non-CloudWatch collector. |
max_concurrent_scenarios |
5 | Maximum number of scenarios to invoke and evaluate in parallel. |
Dataset
A Dataset loaded from a JSON file via FileDatasetProvider or constructed inline.
See Dataset schema for the full field reference.
Result structure
The runner returns an EvaluationResult with the following structure:
EvaluationResult └── scenario_results: List[ScenarioResult] ├── scenario_id: str ├── session_id: str ├── status: "COMPLETED" | "FAILED" ├── error: Optional[str] └── evaluator_results: List[EvaluatorResult] ├── evaluator_id: str └── results: List[Dict] # Raw API responses
Each entry in results is a raw response dict from the Evaluate API, containing fields
like value, label, explanation, context, tokenUsage, and
ignoredReferenceInputFields. See
Getting started with on-demand evaluation
for the full response format.
A scenario with status FAILED means a structural problem occurred (agent invocation
error, span collection failure). Individual evaluator errors within a COMPLETED scenario
are recorded in the evaluator's results list with errorCode and errorMessage fields.