Agent Framework Instrumentation Library Instrumentation Agent Session Trace Tool Call Reference Free Large Language Models (LLMs) as judges

Evaluation terminology

Understanding key concepts and terminology is essential for effectively using AgentCore Evaluations. The following terms define the core components and processes involved in agent evaluation.

Topics

Agent Framework
Instrumentation Library
Instrumentation Agent
Session
Trace
Tool Call
Reference Free Large Language Models (LLMs) as judges

Agent Framework

An agent framework provides the foundational components for building, orchestrating, and running agent-based applications. Frameworks define structures such as steps, tools, control flow, and memory management. Common industry frameworks include Strands Agents, LangGraph etc. These frameworks help standardize how agents are constructed and make them easier to instrument and evaluate.

AgentCore Evaluations currently supports Strands Agents and LangGraph agent frameworks.

Instrumentation Library

An instrumentation library records telemetry generated by your agent during execution. This telemetry can include traces, spans, tool calls, model invocations, and intermediate steps. Libraries such as OpenTelemetry and OpenInference offer standardized APIs and semantic conventions that allow you to capture agent behavior with minimal code changes. Instrumentation is required for trace collection and evaluation.

AgentCore Evaluations currently supports OpenTelemetry and OpenInference as instrumentation libraries.

Instrumentation Agent

An instrumentation agent automatically captures telemetry from application code, processes it, and exports it to a backend service for storage or evaluation. Tools such as ADOT (AWS Distro for OpenTelemetry) provide a vendor-neutral, production-ready auto-instrumentation agent that dynamically injects bytecode to capture traces without code changes. The agent is a key component in enabling automated evaluation.

AgentCore Evaluations currently supports ADOT (AWS Distro for OpenTelemetry) as the instrumentation agent.

Session

A session represents a logical grouping of related interactions from a single user or workflow. A session may contain one or more traces. Sessions help you view and evaluate agent behavior across multi-step interactions, rather than focusing on individual requests.

Trace

A trace is a complete record of a single agent execution or request. A trace contains one or more spans, which represent the individual operations performed during that execution. Traces provide end-to-end visibility into agent decisions and tool usage.

Tool Call

A tool call is a span that represents an agent's invocation of an external function, API, or capability. Tool call spans typically capture information such as the tool name, input parameters, execution time, and output. Tool call details are used to evaluate whether the agent selected and used tools correctly and efficiently.

Reference Free Large Language Models (LLMs) as judges

Large Language Models (LLMs) as judges refers to an evaluation method that uses a large language model (LLM) to automatically assess the quality, correctness, or effectiveness of an agent or another model's output. Instead of relying on manual review or rule-based checks, the LLM is prompted with evaluation criteria and produces a score, label, or explanation based on the input and output being evaluated. Unlike traditional evaluations that rely on ground-truth data, LLM-as-a-judge methods rely on the model's internal knowledge to make judgments. This approach enables scalable, consistent, and customizable qualitative assessments, such as correctness, reasoning quality, or instruction adherence, across large numbers of agent interactions or model responses.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

How it works

Evaluators