Prerequisites Supported frameworks Step 1: Create and deploy your agent Step 2: Invoke your agent Step 3: Evaluate agent Step 4: Evaluation results

Getting started with on-demand evaluation

Follow these steps to set up and run your first on-demand evaluation.

Topics

Prerequisites
Supported frameworks
Step 1: Create and deploy your agent
Step 2: Invoke your agent
Step 3: Evaluate agent
Step 4: Evaluation results

Prerequisites

To use AgentCore Evaluations OnDemand Evaluation features, you need:

AWS Account with appropriate IAM permissions
Amazon Bedrock access with model invocation permissions
Transaction Search enabled in CloudWatch - see Enable Transaction Search
Python 3.10 or later installed
The OpenTelemetry library – Include aws-opentelemetry-distro (ADOT) in your requirements.txt file

Supported frameworks

AgentCore Evaluations currently supports the following agentic frameworks and instrumentation libraries:

Strands Agents
LangGraph configured with one of the following instrumentation libraries:
- opentelemetry-instrumentation-langchain
- openinference-instrumentation-langchain

Step 1: Create and deploy your agent

Note

If you have an agent already up and running in AgentCore Runtime, you can directly move to step 2

Create and deploy your agent by following the Get Started guide for AgentCore Runtime. You can find additional examples in the AgentCore Evaluations Samples.

Step 2: Invoke your agent

Invoke your agent using the following command and view the traces, sessions and metrics on GenAI Observability dashboard on CloudWatch.

Topics

Example invoke_agent.py

Example invoke_agent.py


import boto3
import json
import uuid

region = "region-code"
ace_demo_agent_arn = "agent-arn from step-2"

agent_core_client = boto3.client('bedrock-agentcore', region_name=region)

text_to_analyze = "Sample text to test agent for agentcore evaluations demo"

payload = json.dumps({   
    "prompt": f"Can you analyze this text and tell me about its statistics: {text_to_analyze}"
})

# random session-id, you can set your own here
session_id = "test-ace-demo-session-18a1dba0-62a0-462g"

response = agent_core_client.invoke_agent_runtime(
    agentRuntimeArn=ace_demo_agent_arn,
    runtimeSessionId=session_id,
    payload=payload,
    qualifier="DEFAULT"
)

response_body = response['response'].read()
response_data = json.loads(response_body)
print("Agent Response:", response_data)
print("SessionId:", session_id)

Step 3: Evaluate agent

Once you have made a few invocations to your agent, you are ready to evaluate it. For evaluations we require:

EvaluatorId: this can be the id for either a builtin evaluator or a custom created one
SessionSpans: spans are the telemetry blocks emitted when you interact with an application. The application in our example is an agent hosted on AgentCore Runtime.
- For on-demand evaluation, we need to download the spans from CloudWatch log groups and use them for evaluation.
- AgentCore starter toolkit does this for you automatically and is the easiest to get started with.
- If you are not using starter toolkit, we will show how to download logs using session-id and use them for evaluation using AWS SDK.

Topics

Code samples for Starter Toolkit, AgentCore SDK, and AWS SDK
AWS SDK

Code samples for Starter Toolkit, AgentCore SDK, and AWS SDK

The following code samples demonstrate how to run on-demand evaluations using different development approaches. Choose the method that best fits your development environment and preferences.

AWS SDK

Download span-logs from CloudWatch

Before calling the Evaluate API, you need to download the span logs from CloudWatch. You can use the Python code below to do so and optionally save them in a JSON file. This makes it easier to make the request for the same session with different evaluators.

Note

It takes a couple of minutes for logs to get populated in CloudWatch, so its possible that if you try running the below script "immediately" after agent invocation, the logs are empty or incomplete


import boto3
import time
import json
from datetime import datetime, timedelta

region = region-code
agent_id = add the agent-id from step-2 here
session_id = use the session-id from step-3 here

def query_logs(log_group_name, query_string):
    client = boto3.client('logs', region_name=region)
    start_time = datetime.now() - timedelta(minutes=60) # past 1 hour
    end_time = datetime.now()
    
    query_id = client.start_query(
        logGroupName=log_group_name,
        startTime=int(start_time.timestamp()),
        endTime=int(end_time.timestamp()),
        queryString=query_string
    )['queryId']
    
    while (result := client.get_query_results(queryId=query_id))['status'] not in ['Complete', 'Failed']:
        time.sleep(1)
    
    if result['status'] == 'Failed':
        raise Exception("Query failed")
    return result['results']

def query_session_logs(log_group_name, session_id, **kwargs):
    query = f"""fields @timestamp, @message   
    | filter ispresent(scope.name) and ispresent(attributes.session.id)
    | filter attributes.session.id = "{session_id}"
    | sort @timestamp asc"""
    return query_logs(log_group_name, query, **kwargs)

def query_agent_runtime_logs(agent_id, endpoint, session_id, **kwargs):
    return query_session_logs(
        f"/aws/bedrock-agentcore/runtimes/{agent_id}-{endpoint}",
        session_id, **kwargs)

def query_aws_spans_logs(session_id, **kwargs):
    return query_session_logs("aws/spans", session_id, **kwargs)

def extract_messages_as_json(query_results):
    return [json.loads(f['value']) for row in query_results   
            for f in row if f['field'] == '@message'   
            and f['value'].strip().startswith('{')]     
   
def get_session_span_logs():                                                          
    agent_runtime_logs = query_agent_runtime_logs(
        agent_id=agent_id, endpoint="DEFAULT", session_id=session_id
    )
    print(f"Downloaded {len(agent_runtime_logs)} runtime-log entries")
    
    aws_span_logs = query_aws_spans_logs(session_id=session_id)
    print(f"Downloaded {len(aws_span_logs)} aws/span entries")
    
    session_span_logs = extract_messages_as_json(aws_span_logs) + extract_messages_as_json(agent_runtime_logs)
    print(f"Returning {len(aws_span_logs) + len(agent_runtime_logs)} total records")
    return session_span_logs
          
# get the spans from cloudwatch
session_span_logs = get_session_span_logs()

# optional (dump in a json file for reuse)
session_span_logs_file_name = "ace-demo-session.json"
with open(session_span_logs_file_name, "w") as f:
    json.dump(session_span_logs, f, indent=2)

Call Evaluate

Once you have the input spans, you can invoke the Evaluate API. Please note that the responses may take a few moments as a large language model is scoring your traces.


# initialise client
ace_dp_client = boto3.client('agentcore-evaluation-dataplane', region_name = region)

# call evaluate
response = ace_dp_client.evaluate(
    evaluatorId = "Builtin.Helpfulness", # can be a custom evaluator id as well    
    evaluationInput = {"sessionSpans": session_span_logs})
    
print(response["evaluationResults"])

If you use above and dump the session-spans in a json file, you can also subsequently run evaluate as below


with open(session_span_logs_file_name, "r") as f:
    session_span_logs = json.load(f)

# initialise client
ace_dp_client = boto3.client('agentcore-evaluation-dataplane', region_name = region)

# call evaluate
response = ace_dp_client.evaluate(
    evaluatorId = "Builtin.ToolSelectionAccuracy", # can be a custom evaluator id as well
    evaluationInput = {"sessionSpans": session_span_logs})

print(response["evaluationResults"])

Using evaluation targets

To evaluate a specific trace or tool within a session, you can specify the target using the evaluationTarget parameter in your request.

Topics

Session-level evaluator
Trace-level evaluator
Tool call level evaluator

The evaluationTarget parameter you specify depends on the evaluator level:

Session-level evaluator

Since the service supports only one session per evaluation, you do not need to explicitly set the evaluation target.

Trace-level evaluator

For trace-level evaluators (such as Builtin.Helpfulness or Builtin.Correctness), set the trace IDs in the evaluationTarget parameter:


response = ace_dp_client.evaluate(
    evaluatorId = "Builtin.Helpfulness",
    evaluationInput = {"sessionSpans": session_span_logs},
    evaluationTarget = {"traceIds": ["trace-id-1", "trace-id-2"]}
)

Tool call level evaluator

For span-level evaluators (such as Builtin.ToolSelectionAccuracy), set the span IDs in the evaluationTarget parameter:


response = ace_dp_client.evaluate(
    evaluatorId = "Builtin.ToolSelectionAccuracy",   
    evaluationInput = {"sessionSpans": session_span_logs},
    evaluationTarget = {"spanIds": ["span-id-1", "span-id-2"]}
)

Step 4: Evaluation results

Each Evaluate API call returns a response containing a list of evaluator results. Because a single session can include multiple traces and tool calls, these elements are evaluated as separate entities. Consequently, a single API call may return multiple evaluation results.


{
    "evaluationResults": [ {evaluation-result-1}, {evaluation-result_2},.... ]
}

Topics

Result limit
Partial failures
Span context
Example successful result entry
Example failed result entry

Result limit

The number of evaluations returned per API call is limited to 10 results. For example, if you evaluate a session containing 15 traces using a trace-level evaluator, the response includes a maximum of 10 results. By default, the API returns the last 10 evaluations, as these typically contain the most context relevant to evaluation quality.

Partial failures

An API call may process n evaluations while m of them fail. Failures can occur due to various reasons, including:

Throttling from model providers
Parsing errors
Model timeouts
Other processing issues

In cases of partial failure, the response includes both successful and failed evaluations. Failed results include an error code and error message to help you diagnose the issue.

Span context

Each evaluator result has a spanContext field that identifies the entity evaluated:

For session-level evaluators, only sessionId is present.
For trace-level evaluators, sessionId and traceId are present.
For tool-level evaluators, sessionId, traceId, and spanId are present.

Example successful result entry

This is just one entry. If a session has multiple traces, you will see multiple such entries, one for each trace. Similarly for tool-level evaluators, if there are multiple tool calls and a tool evaluator (such as Builtin.ToolSelectionAccuracy) is provided, there will be one result per tool span.


{
  "evaluatorArn": "arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness",
  "evaluatorId": "Builtin.Helpfulness",
  "evaluatorName": "Builtin.Helpfulness",
  "explanation": ".... evaluation explanation will be added here ...",
  "context": {
    "spanContext": {
      "sessionId": "test-ace-demo-session-18a1dba0-62a0-462e",
      "traceId": "....trace_id......."
    }
  },
  "value": 0.83,
  "label": "Very Helpful",
  "tokenUsage": {
    "inputTokens": 958,
    "outputTokens": 211,
    "totalTokens": 1169
  }
}

Example failed result entry


{
    "evaluatorArn": "arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness",
    "evaluatorId": "Builtin.Helpfulness",
    "evaluatorName": "Builtin.Helpfulness",
    "context": {
        "spanContext": {
            "sessionId": "test-ace-demo-session-18a1dba0-62a0-462e",
            "traceId": "....trace_id......."
        }
    },
    "errorMessage": ".... details of the error....",
    "errorCode": ".... name/code of the error...."
}

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

IAM permissions for on-demand evaluation

Understanding input spans