

# Dataset evaluation
<a name="dataset-evaluations"></a>

Dataset evaluations let you run your agent against a set of scenarios and automatically evaluate the results. Instead of manually invoking your agent, collecting spans, and calling the Evaluate API, a dataset runner orchestrates the entire lifecycle in a single call: invoke the agent, wait for telemetry ingestion, and evaluate.

This is useful for regression testing, benchmark datasets, CI/CD pipelines, baseline measurement, and pre/post comparison after configuration changes.

The AgentCore SDK provides two dataset runners that share the same [dataset schema](dataset-evaluations-schema.md) and ground truth format but differ in where evaluation happens:
+  **On-demand dataset runner** (`OnDemandEvaluationDatasetRunner`) — Collects spans and calls the Evaluate API client-side. Best for dev-time iteration and small datasets.
+  **Batch dataset runner** (`BatchEvaluationRunner`) — Delegates span collection and evaluation to the service via the batch evaluation API. Best for large datasets and production baselines.

 **Choosing a runner** 


| Aspect | On-demand runner | Batch runner | 
| --- | --- | --- | 
| Span collection | SDK-side via `AgentSpanCollector`  | Server-side; service reads from CloudWatch directly | 
| Evaluate API calls | SDK calls `evaluate()` per evaluator per scenario | SDK calls `startBatchEvaluation()` once | 
| Execution model | Synchronous three-phase pipeline (invoke, wait, evaluate) | Asynchronous four-phase pipeline (invoke, wait, submit, poll) | 
| Results | Structured `EvaluationResult` with per-scenario, per-evaluator detail | Aggregate `BatchEvaluationSummary` with per-evaluator averages, plus per-session detail in CloudWatch | 
| Best for | Dev-time iteration, CI/CD, small datasets, when you need per-scenario detail immediately | Baseline measurement, large datasets, pre/post comparison, when aggregate scores are sufficient | 

 **Prerequisites** 

Both runners require:
+ Python 3.10\+
+ An agent deployed on AgentCore Runtime with observability enabled, or an agent built with a supported framework configured with [AgentCore Observability](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html#observability-configure-3p). Supported frameworks:
  + Strands Agents
  + LangGraph with `opentelemetry-instrumentation-langchain` or `openinference-instrumentation-langchain` 
+ Transaction Search enabled in CloudWatch; see [Enable Transaction Search](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Transaction-Search-getting-started.html) 
+ The AgentCore SDK installed: `pip install bedrock-agentcore` 
+  AWS credentials configured with permissions for `bedrock-agentcore`, `bedrock-agentcore-control`, and `logs` (CloudWatch)

**Topics**
+ [Dataset schema](dataset-evaluations-schema.md)
+ [On-demand dataset runner](dataset-evaluations-on-demand.md)
+ [Batch dataset runner](dataset-evaluations-batch.md)