Generative AI observability - Amazon CloudWatch

Generative AI observability

With Amazon CloudWatch, you can observe generative AI workloads, including Amazon Bedrock AgentCore agents, and gain insights into AI performance, health, and accuracy. CloudWatch provides pre-configured views into latency, usage, and errors of your AI workloads, allowing you to detect issues faster in components like models and agents. End-to-end prompt tracing helps you quickly identify issues in components such as knowledge bases, tools, and models. CloudWatch's AI monitoring capabilities are compatible with popular generative AI orchestration frameworks such as such as AWS Strands, LangChain, and LangGraph, offering flexibility with your choice of framework.

CloudWatch generative AI observability enables you to:

  • Gain insights into end-user outcomes, AI performance, health, and accuracy while reducing human-in-the-loop (HITL) assessment burden

  • Monitor model invocations, Agents (managed, self-hosted, and third-party), knowledge bases, guardrails, and tools

  • Progress from agent experimentation to production of innovative GenAI applications while ensuring superior quality, performance, and reliability. For more information, see What is Amazon Bedrock AgentCore?

  • Identify source of errors quickly using end-to-end prompt tracing, curated metrics, and logs

  • Troubleshoot issues across your entire GenAI application and underlying infrastructure, leveraging existing CloudWatch observability tools such as Application Signals, Alarms, Dashboards, Sensitive data protection, and Logs Insights

  • Access prompt traces while using Amazon Bedrock, and send structured traces of third-party models to CloudWatch using ADOT SDK. For information about adding observability to your Amazon Bedrock AgentCore agent or tool, see Amazon Bedrock AgentCore

CloudWatch generative AI observability provides two pre-built dashboards:

Note

You must enable Amazon Bedrock to view the Model Invocation dashboard.

  • Model Invocations – Detailed metrics on model usage, token consumption, and costs

  • Amazon Bedrock AgentCore agents – Performance and decision metrics for the Amazon Bedrock agents

Key metrics available in these dashboards include:

  • Total and average invocations

  • Token usage (total, average per query, input, output)

  • Latency (average, P90, P99)

  • Error rates and throttling events

  • Cost attribution by application, user role, or specific user