Ground truth evaluations
Ground truth is the known correct answer or expected behavior for a given input — the
"gold standard" you compare actual results against. For agent evaluation, ground truth
transforms subjective quality assessment into objective measurement, enabling regression
detection, benchmark datasets, and domain-specific correctness that generic evaluators
cannot provide on their own.
With ground truth evaluations, you provide reference inputs alongside your session spans
when calling the Evaluate API. The service uses these reference inputs to score your
agent's actual behavior against the expected behavior. Evaluators that don't use a
particular ground truth field ignore it and report which fields were not used in the
response.
Supported builtin evaluators and ground truth fields
The following table shows which built-in evaluators support ground truth and which fields they use.
Evaluators that support ground truth
| Evaluator |
Level |
Ground truth field |
Description |
Builtin.Correctness |
Trace |
expectedResponse |
Measures how accurately the agent's response matches the expected answer. Uses LLM-as-a-Judge scoring. |
Builtin.GoalSuccessRate |
Session |
assertions |
Validates whether the agent's behavior satisfies natural language assertions across the entire session. Uses LLM-as-a-Judge scoring. |
Builtin.TrajectoryExactOrderMatch |
Session |
expectedTrajectory |
Checks that the actual tool call sequence matches the expected sequence exactly — same tools, same order, no extras. Programmatic scoring (no LLM calls). |
Builtin.TrajectoryInOrderMatch |
Session |
expectedTrajectory |
Checks that all expected tools appear in order within the actual sequence, but allows extra tools between them. Programmatic scoring. |
Builtin.TrajectoryAnyOrderMatch |
Session |
expectedTrajectory |
Checks that all expected tools are present in the actual sequence, regardless of order. Extra tools are allowed. Programmatic scoring. |
The following table describes the ground truth fields.
Ground truth fields
| Field |
Type |
Scope |
Description |
expectedResponse |
String |
Trace |
The expected agent response for a specific turn. Scoped to a trace using traceId in the reference input context. |
assertions |
List of strings |
Session |
Natural language statements that should be true about the agent's behavior across the session. |
expectedTrajectory |
List of tool names |
Session |
The expected sequence of tool calls for the session. |
Ground truth fields are optional. If you omit them, evaluators fall back to their
ground truth-free mode (for example, Builtin.Correctness still works without
expectedResponse, it just evaluates based on context alone).
You can provide all ground truth fields in a single request. The service picks the
relevant fields for each evaluator and reports ignoredReferenceInputFields in the
response for any fields that were not used.
You don't need to provide expectedResponse for every trace. Traces without ground
truth are evaluated using the ground truth-free variant of the evaluator.
Prerequisites
Python 3.10+
-
An agent deployed on AgentCore Runtime with observability enabled, or an agent built
with a supported framework configured with
AgentCore Observability.
Supported frameworks:
Transaction Search enabled in CloudWatch — see
Enable Transaction Search
AWS credentials configured with permissions for bedrock-agentcore,
bedrock-agentcore-control, and logs (CloudWatch)
For instructions on downloading session spans, see
Getting started with on-demand evaluation.
About the examples
The examples on this page use the sample agent from the
AgentCore Evaluations tutorials.
The agent has two tools — calculator and weather — and is deployed on AgentCore Runtime
with observability enabled.
The examples assume a two-turn session:
Turn 1: "What is 15 + 27?" — agent uses the calculator tool and responds with the result.
Turn 2: "What's the weather?" — agent uses the weather tool and responds with the current weather.
Before running evaluations, invoke your agent and wait 2–5 minutes for CloudWatch to ingest the telemetry data.
The following constants are used throughout the examples on this page. Replace them with your own values:
REGION = "<region-code>"
AGENT_ID = "my-agent-id"
SESSION_ID = "my-session-id"
TRACE_ID_1 = "<trace-id-1>" # Turn 1: "What is 15 + 27?"
TRACE_ID_2 = "<trace-id-2>" # Turn 2: "What's the weather?"
Correctness with expected response
Builtin.Correctness is a trace-level evaluator that measures how accurately the agent's
response matches an expected answer. When you provide expectedResponse, the evaluator
compares the agent's actual response against your ground truth using LLM-as-a-Judge scoring.
- AgentCore SDK
-
from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs
client = EvaluationClient(region_name=REGION)
# String form — matched against the last trace in the session
results = client.run(
evaluator_ids=["Builtin.Correctness"],
agent_id=AGENT_ID,
session_id=SESSION_ID,
reference_inputs=ReferenceInputs(
expected_response="The weather is sunny",
),
)
for r in results:
print(f"Trace: {r['context']['spanContext'].get('traceId', 'session')}")
print(f"Score: {r['value']}, Label: {r['label']}")
To target a specific trace, pass expected_response as a dict mapping trace IDs to expected answers:
results = client.run(
evaluator_ids=["Builtin.Correctness"],
agent_id=AGENT_ID,
session_id=SESSION_ID,
reference_inputs=ReferenceInputs(
expected_response={
TRACE_ID_1: "15 + 27 = 42",
TRACE_ID_2: "The weather is sunny",
},
),
)
- AgentCore CLI
-
# Expected response matched against the last trace
agentcore run eval \
--agent AGENT_NAME \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.Correctness" \
--expected-response "The weather is sunny"
# Target a specific trace
agentcore run eval \
--agent AGENT_NAME \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.Correctness" \
--trace-id TRACE_ID_1 \
--expected-response "15 + 27 = 42"
# ARN mode — evaluate an agent outside the CLI project
agentcore run eval \
--runtime-arn arn:aws:bedrock-agentcore:<region-code>:<account-id>:runtime/<agent-id> \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.Correctness" \
--expected-response "The weather is sunny"
- Starter Toolkit SDK
-
from bedrock_agentcore_starter_toolkit import Evaluation, ReferenceInputs
eval_client = Evaluation(region=REGION)
# String form — matched against the last trace
results = eval_client.run(
agent_id=AGENT_ID,
session_id=SESSION_ID,
evaluators=["Builtin.Correctness"],
reference_inputs=ReferenceInputs(
expected_response="The weather is sunny",
),
)
for r in results.get_successful_results():
print(f"Score: {r.value:.2f}, Label: {r.label}")
To target a specific trace, pass a tuple of (trace_id, expected_response):
results = eval_client.run(
agent_id=AGENT_ID,
session_id=SESSION_ID,
evaluators=["Builtin.Correctness"],
reference_inputs=ReferenceInputs(
expected_response=(TRACE_ID_1, "15 + 27 = 42"),
),
)
- Starter Toolkit CLI
-
# Expected response matched against the last trace
agentcore eval run \
--agent-id AGENT_ID \
--session-id SESSION_ID \
--evaluator "Builtin.Correctness" \
--expected-response "The weather is sunny"
# Target a specific trace
agentcore eval run \
--agent-id AGENT_ID \
--session-id SESSION_ID \
--trace-id TRACE_ID_1 \
--evaluator "Builtin.Correctness" \
--expected-response "15 + 27 = 42"
# Save results to a file
agentcore eval run \
--agent-id AGENT_ID \
--session-id SESSION_ID \
--evaluator "Builtin.Correctness" \
--expected-response "The weather is sunny" \
--output results.json
- AWS SDK (boto3)
-
import boto3
client = boto3.client("bedrock-agentcore", region_name=REGION)
response = client.evaluate(
evaluatorId="Builtin.Correctness",
evaluationInput={"sessionSpans": session_spans_and_log_events},
evaluationReferenceInputs=[
{
"context": {
"spanContext": {
"sessionId": SESSION_ID,
"traceId": TRACE_ID_1
}
},
"expectedResponse": {"text": "15 + 27 = 42"}
},
{
"context": {
"spanContext": {
"sessionId": SESSION_ID,
"traceId": TRACE_ID_2
}
},
"expectedResponse": {"text": "The weather is sunny"}
}
]
)
for result in response["evaluationResults"]:
print(f"Score: {result['value']}, Label: {result['label']}")
GoalSuccessRate with assertions
Builtin.GoalSuccessRate is a session-level evaluator that validates whether the agent's
behavior satisfies a set of natural language assertions. Assertions can check tool usage,
response content, ordering of actions, or any other observable behavior across the entire
conversation.
The examples below use assertions that validate tool usage, but assertions
are free-form natural language — you can use them to assert on any aspect of agent
behavior, such as response tone, factual accuracy, safety compliance, or business
logic.
- AgentCore SDK
-
from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs
client = EvaluationClient(region_name=REGION)
results = client.run(
evaluator_ids=["Builtin.GoalSuccessRate"],
agent_id=AGENT_ID,
session_id=SESSION_ID,
reference_inputs=ReferenceInputs(
assertions=[
"Agent used the calculator tool to compute the result",
"Agent returned the correct numerical answer of 42",
"Agent used the weather tool when asked about weather",
],
),
)
for r in results:
print(f"Score: {r['value']}, Label: {r['label']}")
print(f"Explanation: {r['explanation'][:200]}")
- AgentCore CLI
-
agentcore run eval \
--agent AGENT_NAME \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.GoalSuccessRate" \
--assertion "Agent used the calculator tool to compute the result" \
--assertion "Agent returned the correct numerical answer of 42" \
--assertion "Agent used the weather tool when asked about weather"
# ARN mode — evaluate an agent outside the CLI project
agentcore run eval \
--runtime-arn arn:aws:bedrock-agentcore:<region-code>:<account-id>:runtime/<agent-id> \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.GoalSuccessRate" \
--assertion "Agent used the calculator tool to compute the result" \
--assertion "Agent returned the correct numerical answer of 42"
- Starter Toolkit SDK
-
from bedrock_agentcore_starter_toolkit import Evaluation, ReferenceInputs
eval_client = Evaluation(region=REGION)
results = eval_client.run(
agent_id=AGENT_ID,
session_id=SESSION_ID,
evaluators=["Builtin.GoalSuccessRate"],
reference_inputs=ReferenceInputs(
assertions=[
"Agent used the calculator tool to compute the result",
"Agent returned the correct numerical answer of 42",
"Agent used the weather tool when asked about weather",
],
),
)
for r in results.get_successful_results():
print(f"Score: {r.value:.2f}, Label: {r.label}")
- Starter Toolkit CLI
-
agentcore eval run \
--agent-id AGENT_ID \
--session-id SESSION_ID \
--evaluator "Builtin.GoalSuccessRate" \
--assertion "Agent used the calculator tool to compute the result" \
--assertion "Agent returned the correct numerical answer of 42" \
--assertion "Agent used the weather tool when asked about weather"
- AWS SDK (boto3)
-
import boto3
client = boto3.client("bedrock-agentcore", region_name=REGION)
response = client.evaluate(
evaluatorId="Builtin.GoalSuccessRate",
evaluationInput={"sessionSpans": session_spans_and_log_events},
evaluationReferenceInputs=[
{
"context": {
"spanContext": {
"sessionId": SESSION_ID
}
},
"assertions": [
{"text": "Agent used the calculator tool to compute the result"},
{"text": "Agent returned the correct numerical answer of 42"},
{"text": "Agent used the weather tool when asked about weather"}
]
}
]
)
for result in response["evaluationResults"]:
print(f"Score: {result['value']}, Label: {result['label']}")
Trajectory matching with expected trajectory
The trajectory evaluators compare the agent's actual tool call sequence against an
expected sequence of tool names. Three variants are available, each with different
matching strictness. All three are session-level evaluators and use programmatic scoring
(no LLM calls, so token usage is zero).
Trajectory matching variants
| Evaluator |
Matching rule |
Example |
Builtin.TrajectoryExactOrderMatch |
Actual must match expected exactly — same tools, same order, no extras |
Expected: [calculator, weather], Actual: [calculator, weather] → Pass. Actual: [calculator, weather, calculator] → Fail. |
Builtin.TrajectoryInOrderMatch |
Expected tools must appear in order, but extra tools are allowed between them |
Expected: [calculator, weather], Actual: [calculator, some_tool, weather] → Pass. |
Builtin.TrajectoryAnyOrderMatch |
All expected tools must be present, order doesn't matter, extras allowed |
Expected: [calculator, weather], Actual: [weather, calculator] → Pass. |
- AgentCore SDK
-
from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs
client = EvaluationClient(region_name=REGION)
results = client.run(
evaluator_ids=[
"Builtin.TrajectoryExactOrderMatch",
"Builtin.TrajectoryInOrderMatch",
"Builtin.TrajectoryAnyOrderMatch",
],
agent_id=AGENT_ID,
session_id=SESSION_ID,
reference_inputs=ReferenceInputs(
expected_trajectory=["calculator", "weather"],
),
)
for r in results:
print(f"{r['evaluatorId']}: {r['value']} ({r['label']})")
print(f" {r['explanation'][:150]}")
- AgentCore CLI
-
Tool names are passed as a comma-separated list:
agentcore run eval \
--agent AGENT_NAME \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.TrajectoryExactOrderMatch" \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.TrajectoryInOrderMatch" \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.TrajectoryAnyOrderMatch" \
--expected-trajectory "calculator,weather"
# ARN mode — evaluate an agent outside the CLI project
agentcore run eval \
--runtime-arn arn:aws:bedrock-agentcore:<region-code>:<account-id>:runtime/<agent-id> \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.TrajectoryExactOrderMatch" \
--expected-trajectory "calculator,weather"
- Starter Toolkit SDK
-
from bedrock_agentcore_starter_toolkit import Evaluation, ReferenceInputs
eval_client = Evaluation(region=REGION)
results = eval_client.run(
agent_id=AGENT_ID,
session_id=SESSION_ID,
evaluators=[
"Builtin.TrajectoryExactOrderMatch",
"Builtin.TrajectoryInOrderMatch",
"Builtin.TrajectoryAnyOrderMatch",
],
reference_inputs=ReferenceInputs(
expected_trajectory=["calculator", "weather"],
),
)
for r in results.get_successful_results():
print(f"{r.evaluator_name}: {r.value:.2f} ({r.label})")
- Starter Toolkit CLI
-
Tool names are passed as a comma-separated list:
agentcore eval run \
--agent-id AGENT_ID \
--session-id SESSION_ID \
--evaluator "Builtin.TrajectoryExactOrderMatch" \
--evaluator "Builtin.TrajectoryInOrderMatch" \
--evaluator "Builtin.TrajectoryAnyOrderMatch" \
--expected-trajectory "calculator,weather"
- AWS SDK (boto3)
-
import boto3
client = boto3.client("bedrock-agentcore", region_name=REGION)
for evaluator in [
"Builtin.TrajectoryExactOrderMatch",
"Builtin.TrajectoryInOrderMatch",
"Builtin.TrajectoryAnyOrderMatch",
]:
response = client.evaluate(
evaluatorId=evaluator,
evaluationInput={"sessionSpans": session_spans_and_log_events},
evaluationReferenceInputs=[
{
"context": {
"spanContext": {
"sessionId": SESSION_ID
}
},
"expectedTrajectory": {
"toolNames": ["calculator", "weather"]
}
}
]
)
for result in response["evaluationResults"]:
print(f"{result['evaluatorId']}: {result['value']} ({result['label']})")
Combining all ground truth fields in one request
You can pass all ground truth fields together in a single evaluation call. The service
routes each field to the appropriate evaluator and ignores fields that a given evaluator
doesn't use. This means you can construct your reference inputs once and reuse them across
different evaluators without modifying the payload.
- AgentCore SDK
-
from bedrock_agentcore.evaluation import EvaluationClient, ReferenceInputs
client = EvaluationClient(region_name=REGION)
results = client.run(
evaluator_ids=[
"Builtin.Correctness",
"Builtin.GoalSuccessRate",
"Builtin.TrajectoryExactOrderMatch",
"Builtin.TrajectoryInOrderMatch",
"Builtin.TrajectoryAnyOrderMatch",
],
agent_id=AGENT_ID,
session_id=SESSION_ID,
reference_inputs=ReferenceInputs(
expected_response="The weather is sunny",
assertions=[
"Agent used the calculator tool for math",
"Agent used the weather tool when asked about weather",
],
expected_trajectory=["calculator", "weather"],
),
)
for r in results:
ignored = r.get("ignoredReferenceInputFields", [])
print(f"{r['evaluatorId']}: {r['value']} ({r['label']})")
if ignored:
print(f" Ignored fields: {ignored}")
- AgentCore CLI
-
agentcore run eval \
--agent AGENT_NAME \
--session-id SESSION_ID \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.Correctness" \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.GoalSuccessRate" \
--evaluator-arn "arn:aws:bedrock-agentcore:::evaluator/Builtin.TrajectoryExactOrderMatch" \
--assertion "Agent used the calculator tool for math" \
--assertion "Agent used the weather tool when asked about weather" \
--expected-trajectory "calculator,weather" \
--expected-response "The weather is sunny" \
--output results.json
- Starter Toolkit SDK
-
from bedrock_agentcore_starter_toolkit import Evaluation, ReferenceInputs
eval_client = Evaluation(region=REGION)
results = eval_client.run(
agent_id=AGENT_ID,
session_id=SESSION_ID,
evaluators=[
"Builtin.Correctness",
"Builtin.GoalSuccessRate",
"Builtin.TrajectoryExactOrderMatch",
"Builtin.TrajectoryInOrderMatch",
"Builtin.TrajectoryAnyOrderMatch",
],
reference_inputs=ReferenceInputs(
expected_response="The weather is sunny",
assertions=[
"Agent used the calculator tool for math",
"Agent used the weather tool when asked about weather",
],
expected_trajectory=["calculator", "weather"],
),
)
for r in results.get_successful_results():
print(f"{r.evaluator_name}: {r.value:.2f} ({r.label})")
- AWS SDK (boto3)
-
import boto3
client = boto3.client("bedrock-agentcore", region_name=REGION)
reference_inputs = [
{
"context": {
"spanContext": {"sessionId": SESSION_ID}
},
"assertions": [
{"text": "Agent used the calculator tool for math"},
{"text": "Agent used the weather tool when asked about weather"}
],
"expectedTrajectory": {
"toolNames": ["calculator", "weather"]
}
},
{
"context": {
"spanContext": {
"sessionId": SESSION_ID,
"traceId": TRACE_ID_2
}
},
"expectedResponse": {"text": "The weather is sunny"}
}
]
for evaluator in ["Builtin.Correctness", "Builtin.GoalSuccessRate",
"Builtin.TrajectoryExactOrderMatch"]:
response = client.evaluate(
evaluatorId=evaluator,
evaluationInput={"sessionSpans": session_spans_and_log_events},
evaluationReferenceInputs=reference_inputs
)
for result in response["evaluationResults"]:
ignored = result.get("ignoredReferenceInputFields", [])
print(f"{result['evaluatorId']}: {result['value']} ({result['label']})")
if ignored:
print(f" Ignored fields: {ignored}")
Understanding ignored reference input fields
When you provide ground truth fields that an evaluator doesn't use, the response includes
an ignoredReferenceInputFields array listing the unused fields. This is informational,
not an error — the evaluation still completes successfully.
For example, if you call Builtin.Helpfulness with expectedResponse provided, the
evaluator ignores the ground truth (Helpfulness doesn't use it) and returns:
{
"evaluatorId": "Builtin.Helpfulness",
"value": 0.83,
"label": "Very Helpful",
"explanation": "...",
"ignoredReferenceInputFields": ["expectedResponse"]
}
This behavior is by design — it allows you to construct a single set of reference inputs
and use them across multiple evaluators without adjusting the payload for each one.
Ground truth in custom evaluators
Custom evaluators can use ground truth fields through placeholders in their evaluation
instructions. When you create a custom evaluator, you can reference the following placeholders:
Session-level custom evaluators: {context}, {available_tools},
{actual_tool_trajectory}, {expected_tool_trajectory}, {assertions}
Trace-level custom evaluators: {context}, {assistant_turn},
{expected_response}
For example, a custom trace-level evaluator that checks response similarity might use:
Compare the agent's response with the expected response.
Agent response: {assistant_turn}
Expected response: {expected_response}
Rate how closely the agent's response matches the expected response on a scale of 0 to 1.
When this evaluator is called with expectedResponse in the reference inputs, the
service substitutes the placeholder with the actual ground truth value before scoring.
For details on creating custom evaluators, see
Custom evaluators.
Custom evaluators that use ground truth placeholders ({assertions},
{expected_response}, {expected_tool_trajectory}) cannot be used in online evaluation
configurations, because online evaluations monitor live production traffic
where ground truth values are not available.