View a markdown version of this page

A/B testing - Amazon Bedrock AgentCore

A/B testing

A/B testing splits live production traffic between two variants and continuously evaluates performance with statistical significance. The AgentCore Gateway handles traffic routing; your agent code does not change.

A/B testing is the validation step in the AgentCore optimization improvement loop. After generating a recommendation and validating it with offline batch evaluations, you run an A/B test to confirm the change improves performance on live traffic before committing to a full rollout. You can route traffic to separate AgentCore Runtimes (target-based) or deliver different configurations to the same AgentCore Runtime (configuration bundles).

When to use A/B testing

Use A/B testing when you need to:

  • Validate a recommendation before routing all production traffic to the optimized configuration.

  • Compare two model versions (for example, moving from one foundation model to another) on live traffic with statistical rigor.

  • Measure the impact of a prompt change across real user sessions rather than a curated test set.

  • Gradually roll out a new capability (new tools, updated system prompt) by validating it on a subset of live traffic before full deployment.

How it works

An A/B test follows this flow:

  1. You create an A/B test by specifying an AgentCore Gateway, two variants (control and treatment), traffic weights, an online evaluation configuration for scoring, and an IAM execution role. Each variant references either an AgentCore Gateway target or a configuration bundle version.

  2. You start the test. The AgentCore Gateway begins splitting incoming traffic between the two variants based on the runtime session ID. Assignment is sticky; a given session ID always routes to the same variant.

  3. Online evaluation scores each session. The online evaluation configuration you specified runs evaluators against each session as it completes. The A/B test aggregation pipeline maps scores to variants.

  4. The service computes statistical significance. As sample sizes grow, the service calculates per-evaluator metrics for each variant: mean score, absolute and percent change, p-value, confidence interval, and a significance flag. A p-value below 0.05 indicates the difference is statistically significant. You can poll results at any time without affecting statistical validity.

  5. You stop the test and deploy the winner. When results are significant, stop the A/B test and route 100% of traffic to the winning variant. The losing variant stops receiving traffic.

A/B test patterns

A/B tests support two variant configuration patterns:

Target-based variants

Use when the change includes code changes, a framework upgrade, or when you want to compare entirely different agent implementations. Each variant routes to a different AgentCore Gateway target pointing to a different runtime endpoint.

Configuration bundle variants

Use when the change is purely configuration (system prompt, model ID, or tool descriptions). Both variants run on the same AgentCore Runtime with different configuration bundle versions. The AgentCore Gateway injects the bundle reference into each request via W3C baggage headers, which the runtime can use to pull the configurations using the AgentCore SDK.

Choosing a pattern
Aspect Target-based variants Configuration bundle variants

What varies

Entire runtime endpoint (code, framework, model)

System prompt, tool descriptions, model parameters

Routing

Different targets per variant

Same target, different config bundles

Evaluation config

One online eval config per AgentCore Runtime

Single shared online eval config

When to use

Code changes, framework upgrades, comparing different agents

Configuration-only changes on a single AgentCore Runtime

Topics