Evaluation terminology
Understanding key concepts and terminology is essential for effectively using AgentCore Evaluations. The following terms define the core components and processes involved in agent evaluation.
Topics
Agent Framework
An agent framework provides the foundational components for building,
orchestrating, and running agent-based applications. Frameworks define structures
such as steps, tools, control flow, and memory management. Common industry
frameworks include Strands Agents
AgentCore Evaluations currently supports Strands Agents and LangGraph agent frameworks.
Instrumentation Library
An instrumentation library records telemetry generated by your agent during execution. This telemetry can include traces, spans, tool calls, model invocations, and intermediate steps. Libraries such as OpenTelemetry and OpenInference offer standardized APIs and semantic conventions that allow you to capture agent behavior with minimal code changes. Instrumentation is required for trace collection and evaluation.
AgentCore Evaluations currently supports OpenTelemetry and OpenInference as instrumentation libraries.
Instrumentation Agent
An instrumentation agent automatically captures telemetry from application code, processes it, and exports it to a backend service for storage or evaluation. Tools such as ADOT (AWS Distro for OpenTelemetry) provide a vendor-neutral, production-ready auto-instrumentation agent that dynamically injects bytecode to capture traces without code changes. The agent is a key component in enabling automated evaluation.
AgentCore Evaluations currently supports ADOT (AWS Distro for OpenTelemetry) as the instrumentation agent.
Session
A session represents a logical grouping of related interactions from a single user or workflow. A session may contain one or more traces. Sessions help you view and evaluate agent behavior across multi-step interactions, rather than focusing on individual requests.
Trace
A trace is a complete record of a single agent execution or request. A trace contains one or more spans, which represent the individual operations performed during that execution. Traces provide end-to-end visibility into agent decisions and tool usage.
Tool Call
A tool call is a span that represents an agent's invocation of an external function, API, or capability. Tool call spans typically capture information such as the tool name, input parameters, execution time, and output. Tool call details are used to evaluate whether the agent selected and used tools correctly and efficiently.
Reference Free Large Language Models (LLMs) as judges
Large Language Models (LLMs) as judges refers to an evaluation method that uses a large language model (LLM) to automatically assess the quality, correctness, or effectiveness of an agent or another model's output. Instead of relying on manual review or rule-based checks, the LLM is prompted with evaluation criteria and produces a score, label, or explanation based on the input and output being evaluated. Unlike traditional evaluations that rely on ground-truth data, LLM-as-a-judge methods rely on the model's internal knowledge to make judgments. This approach enables scalable, consistent, and customizable qualitative assessments, such as correctness, reasoning quality, or instruction adherence, across large numbers of agent interactions or model responses.