# OPS 4. How do you design your workload so that you can understand its state?
<a name="ops-04"></a>

 Design your workload so that it provides the information necessary across all components (for example, metrics, logs, and traces) for you to understand its internal state. This allows you to provide effective responses when appropriate. 

**Topics**
+ [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md)
+ [OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md)
+ [OPS04-BP03 Implement user activity telemetry](ops_telemetry_customer_telemetry.md)
+ [OPS04-BP04 Implement dependency telemetry](ops_telemetry_dependency_telemetry.md)
+ [OPS04-BP05 Implement transaction traceability](ops_telemetry_dist_trace.md)

# OPS04-BP01 Implement application telemetry
<a name="ops_telemetry_application_telemetry"></a>

 Application telemetry is the foundation for observability of your workload. Your application should emit telemetry that provides insight into the state of the application and the achievement of business outcomes. From troubleshooting to measuring the impact of a new feature, application telemetry informs the way you build, operate, and evolve your workload. 

 Application telemetry consists of metrics and logs. Metrics are diagnostic information, such as your pulse or temperature. Metrics are used collectively to describe the state of your application. Collecting metrics over time can be used to develop baselines and detect anomalies. Logs are messages that the application sends about its internal state or events that occur. Error codes, transaction identifiers, and user actions are examples of events that are logged. 

 **Desired Outcome:** 
+  Your application emits metrics and logs that provide insight into its health and the achievement of business outcomes. 
+  Metrics and logs are stored centrally for all applications in the workload. 

 **Common anti-patterns:** 
+  Your application doesn't emit telemetry. You are forced to rely upon your customers to tell you when something is wrong. 
+  A customer has reported that your application is unresponsive. You have no telemetry and are unable to confirm that the issue exists or characterize the issue without using the application yourself to understand the current user experience. 

 **Benefits of establishing this best practice:** 
+  You can understand the health of your application, the user experience, and the achievement of business outcomes. 
+  You can react quickly to changes in your application health. 
+  You can develop application health trends. 
+  You can make informed decisions about improving your application. 
+  You can detect and resolve application issues faster. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing application telemetry consists of three steps: identifying a location to store telemetry, identifying telemetry that describes the state of the application, and instrumenting the application to emit telemetry. 

 **Customer example** 

AnyCompany Retail has a microservices based architecture. As part of their architectural design process, they identified application telemetry that would help them understand the state of each microservice. For example, the user cart service emits telemetry about events like add to cart, abandon cart, and length of time it took to add an item to the cart. All microservices log errors, warnings, and transaction information. Telemetry is sent to Amazon CloudWatch for storage and analysis. 

 **Implementation steps** 

1.  Identify a central location for telemetry storage for the applications in your workload. The location should support both collection of telemetry and analysis capabilities. Anomaly detection and automated insights are recommended features. 

   1.  [Amazon CloudWatch](https://aws.amazon.com/cloudwatch) provides telemetry collection, dashboards, analysis, and event generation capabilities. 

1.  To identify what telemetry you need, start by answering this question: what is the state of my application? Your application should emit logs and metrics that collectively answer this question. If you can’t answer the questions with the existing application telemetry, work with business and engineering stakeholders to create a list of telemetry requirements. 

   1.  You can request expert technical advice from your AWS account team as you identify and develop new application telemetry. 

1.  Once the additional application telemetry has been identified, work with your engineering stakeholders to instrument your application. 

   1.  The [AWS Distro for Open Telemetry](https://aws-otel.github.io/) provides APIs, libraries, and agents that collect application telemetry. [This example demonstrates how to instrument a JavaScript application with custom metrics](https://aws-otel.github.io/docs/getting-started/js-sdk/metric-manual-instr). 

   1.  If you want to understand the observability services that AWS offers, work through the [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) or request support from your AWS account team. 

   1.  For a deeper dive into application telemetry, read the [Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) article in the Amazon Builder’s Library, which explains how Amazon instruments applications and can serve as a guide for developing your own instrumentation guidelines. 

 **Level of effort for the implementation plan:** High. Instrumenting your application and centralizing telemetry storage can take significant investment. 

## Resources
<a name="resources"></a>

 **Related best practices:** 

[OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md) – Application telemetry is a component of workload telemetry. In order to understand the health of the overall workload you need to understand the health of individual applications that make up the workload. 

[OPS04-BP03 Implement user activity telemetry](ops_telemetry_customer_telemetry.md) – User activity telemetry is often a subset of application telemetry. User activity like add to cart events, click streams, or completed transactions provide insight into the user experience. 

[OPS04-BP04 Implement dependency telemetry](ops_telemetry_dependency_telemetry.md) – Dependency checks are related to application telemetry and may be instrumented into your application. If your application relies on external dependencies like DNS or a database your application can emit metrics and logs on reachability, timeouts, and other events. 

[OPS04-BP05 Implement transaction traceability](ops_telemetry_dist_trace.md) – Tracing transactions across a workload requires each application to emit information about how they process shared events. The way individual applications handle these events is emitted through their application telemetry. 

[OPS08-BP02 Define workload metrics](ops_workload_health_design_workload_metrics.md) – Workload metrics are the key health indicators for your workload. Key application metrics are a part of workload metrics. 

 **Related documents:** 
+  [AWS Builders Library – Instrumenting Distributed Systems for Operational Visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) 
+  [AWS Well-Architected Operational Excellence Whitepaper – Design Telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/design-telemetry.html) 
+  [Creating metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
+  [Implementing Logging and Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/welcome.html) 
+  [Monitoring application health and performance with AWS Distro for OpenTelemetry](https://aws.amazon.com/blogs/opensource/monitoring-application-health-and-performance-with-aws-distro-for-opentelemetry/) 
+  [New – How to better monitor your custom application metrics using Amazon CloudWatch Agent](https://aws.amazon.com/blogs/devops/new-how-to-better-monitor-your-custom-application-metrics-using-amazon-cloudwatch-agent/) 
+  [Observability at AWS](https://aws.amazon.com/products/management-and-governance/use-cases/monitoring-and-observability/) 
+  [Scenario – Publish metrics to CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/PublishMetrics.html) 
+  [Start Building – How to Monitor your Applications Effectively](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/) 
+  [Using CloudWatch with an AWS SDK](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/sdk-general-information-section.html) 

 **Related videos:** 
+  [AWS re:Invent 2021 - Observability the open-source way](https://www.youtube.com/watch?v=vAnIhIwE5hY) 
+  [Collect Metrics and Logs from Amazon EC2 instances with the CloudWatch Agent](https://www.youtube.com/watch?v=vAnIhIwE5hY) 
+  [How to Easily Setup Application Monitoring for Your AWS Workloads - AWS Online Tech Talks](https://www.youtube.com/watch?v=LKCth30RqnA) 
+  [Mastering Observability of Your Serverless Applications - AWS Online Tech Talks](https://www.youtube.com/watch?v=CtsiXhiAUq8) 
+  [Open Source Observability with AWS - AWS Virtual Workshop](https://www.youtube.com/watch?v=vAnIhIwE5hY) 

 **Related examples:** 
+  [AWS Logging & Monitoring Example Resources](https://github.com/aws-samples/logging-monitoring-apg-guide-examples) 
+  [AWS Solution: Amazon CloudWatch Monitoring Framework](https://aws.amazon.com/solutions/implementations/amazon-cloudwatch-monitoring-framework/?did=sl_card&trk=sl_card) 
+  [AWS Solution: Centralized Logging](https://aws.amazon.com/solutions/implementations/centralized-logging/) 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) 

 **Related services:** 
+ [ Amazon CloudWatch ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)

# OPS04-BP02 Implement and configure workload telemetry
<a name="ops_telemetry_workload_telemetry"></a>

 Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events. Use this information to help determine when a response is required. 

 Use a service such as [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to aggregate logs and metrics from workload components (for example, API logs from [AWS CloudTrail](https://aws.amazon.com/cloudtrail/), [AWS Lambda metrics](https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html), [Amazon VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html), and [other services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/aws-services-sending-logs.html)). 

 **Common anti-patterns:** 
+  Your customers are complaining about poor performance. There are no recent changes to your application and so you suspect an issue with a workload component. You have no telemetry to analyze to determine what component or components are contributing to the poor performance. 
+  Your application is unreachable. You lack the telemetry to determine if it's a networking issue. 

 **Benefits of establishing this best practice:** Understanding what is going on inside your workload helps you to respond if necessary. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Implement log and metric telemetry: Instrument your workload to emit information about its internal state, status, and the achievement of business outcomes. Use this information to determine when a response is required. 
  +  [Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks](https://youtu.be/1Ck_me4azMw) 
  +  [How Amazon CloudWatch works](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html) 
  +  [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 
    +  Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status (for example, API call volume, HTTP status codes, and scaling events). 
      +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
      +  [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) 
      +  [What Is AWS CloudTrail?](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) 
      +  [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) 
+  [Amazon CloudWatch Documentation](https://docs.aws.amazon.com/cloudwatch/index.html) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [How Amazon CloudWatch works](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
+  [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) 
+  [What Is AWS CloudTrail?](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) 
+  [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 
+  [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 

 **Related videos:** 
+  [Application Performance Management on AWS](https://www.youtube.com/watch?v=5T4stR-HFas) 
+  [Gaining Better Observability of Your VMs with Amazon CloudWatch](https://youtu.be/1Ck_me4azMw) 
+  [Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks](https://youtu.be/1Ck_me4azMw) 

# OPS04-BP03 Implement user activity telemetry
<a name="ops_telemetry_customer_telemetry"></a>

Instrument your application code to emit information about user activity. Examples of user activity include click streams or started, abandoned, and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required. Capturing real user activity allows you to build synthetic activity that can be used to monitor and test your workload in production.

 **Desired outcome:** 
+  Your workload emits telemetry about user activity across all applications. 
+  You leverage synthetic user activity to monitor your application during off-peak hours. 

 **Common anti-patterns:** 
+ Your developers have deployed a new feature without user telemetry. You cannot tell if your customers are using the feature without asking them. 
+ After a deployment to your front-end application, you see increased utilization. Because you lack user activity telemetry, it is difficult to identify the exact issue.
+  An issue occurs in your application during off-peak hours. You do not notice the issue until the morning when your users come online because you have not configured synthetic user activity. 

 **Benefits of establishing this best practice:** 
+  Understand common user patterns or unexpected behaviors to optimize functionality of the application to fit your business goals. 
+  Monitor the application from the perspective of your users to detect problems with user experience, such as broken links or slow click responses 
+  Identify the root cause of issues by tracing the steps your impacted user has taken. 
+  Synthetic user activity can provide early warning signs of performance degradation during off-peak hours, allowing you to take corrective action before actual users are affected. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Design your application code to emit information about user activity. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required. Utilize synthetic user activity to provide insight into application performance during off-peak hours. 

 **Customer example** 

 AnyCompany Retail implements user activity telemetry at several layers in their application. The front-end telemetry tracks pointer and movement events while the backend microservices emit telemetry tracking events like adding an item to the user's cart and checking out. Together they provide observability into the user experience. AnyCompany Retail also uses synthetic user telemetry to catch problems when there are fewer users on the workload. 

 **Implementation steps** 

1.  Instrument your application to emit telemetry (metrics, events, logs, and traces) about user activity. Once instrumented, front-end components emit telemetry automatically as the user interacts with the user interface. Backend applications emit telemetry on user events and transactions. 

   1.  [Amazon CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) can provide insight into end user experience for front-end applications. 

   1.  You can use the [AWS Distro for Open Telemetry](https://aws-otel.github.io/) to instrument and capture telemetry from your applications. 

   1.  [Amazon Pinpoint](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html) can analyze user behavior through campaigns, providing insight on user engagement. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop helps you build an observability strategy for your workload. 

1.  Establish synthetic user activity to monitor your application. Synthetic user activity simulates user actions to validate that your application is working properly. 

   1.  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) can simulate user activity using canaries. 

 **Level of effort for the implementation plan:** High. It may take significant development effort to fully instrument your application to collect user activity telemetry. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - Application telemetry is required in order to build in user activity telemetry. 
+  [OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md) - Some user activity telemetry may also be considered workload telemetry. 

 **Related documents:** 
+ [ How to Monitor your Applications Effectively ](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/)

 **Related videos:** 
+ [AWS re:Invent 2020: Monitoring production services at Amazon ](https://www.youtube.com/watch?v=hnPcf_Czbvw)
+ [AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch RUM ](https://www.youtube.com/watch?v=NMaeujY9A9Y)
+ [ Testing and Monitoring APIs on AWS - AWS Online Tech Talks ](https://www.youtube.com/watch?v=VQM38CZyjFY)

 **Related examples:** 
+ [ Amazon CloudWatch RUM Web Client ](https://github.com/aws-observability/aws-rum-web)
+ [AWS Distro for Open Telemetry ](https://aws-otel.github.io/)
+ [ Implementing Real User Monitoring of Amplify Application using Amazon CloudWatch RUM ](https://aws.amazon.com/blogs/mobile/implementing-real-user-monitoring-of-amplify-application-using-amazon-cloudwatch-rum/)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

 **Related services:** 
+ [ Amazon CloudWatch RUM ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html)
+ [ Amazon CloudWatch Synthetics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)
+ [ Amazon Pinpoint ](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html)

# OPS04-BP04 Implement dependency telemetry
<a name="ops_telemetry_dependency_telemetry"></a>

Design and configure your workload to emit information about the status of resources it depends on. These are resources that are external to your workload. Examples of external dependencies can include external databases, DNS, and network connectivity. Use this information to determine when a response is required and provide additional context on workload state.

 **Desired outcome:** 
+  Your workload emits telemetry about the status of external dependencies. 
+  You are notified when dependencies are unhealthy. 

 **Common anti-patterns:** 
+ Your users cannot reach your site. You are unable to determine if the reason is a DNS issue without manually performing a check to see if your DNS provider is working. 
+ Your shopping cart application is unable to complete transactions. You are unable to determine if it's a problem with your credit card processing provider without contacting them to verify. 

 **Benefits of establishing this best practice:** 
+  Monitoring external dependencies provides advance notice of issues. 
+  Awareness of the health of your dependencies assists in troubleshooting. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Work with stakeholders to identify external dependencies that your workload depends on. External dependencies can include external databases, APIs, or network connectivity between your workload and resources in other environments. Develop a monitoring strategy to provide awareness of the health of dependencies and proactively alarm if the status changes. 

 **Customer example** 

 AnyCompany Retail’s ecommerce workload relies on a database located in another environment. Every night, data is populated in the database for use in the ecommerce platform. The network connectivity and database support are owned by other teams. The ecommerce team configured several canary alarms to alert them when the network connectivity drops, the database is unreachable, and when the job fails to complete. 

 **Implementation steps** 

1.  Identify external dependencies that your workload relies on. Implement telemetry to track the health or reachability of dependencies. 

   1.  AWS customers can use the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) to monitor the health of AWS services and receive notifications of health events. 

   1.  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) can be used to monitor APIs, URLs, and website contents. 

1.  Set up alerts to notify your organization when a dependency is unhealthy or unreachable. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop will help you build an observability strategy for your workload. 

1.  Identify contacts for dependencies in cases where the dependency is unhealthy. Document how to contact the dependency owner, service agreements, and escalation process. 

 **Level of effort for the implementation plan:** Medium. Implementing dependency telemetry may require building custom monitoring solutions. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - You may build dependency monitoring into your application telemetry. 

 **Related documents:** 
+ [ Monitor your private internal endpoints 24x7 using CloudWatch Synthetics ](https://aws.amazon.com/blogs/mt/monitor-your-private-endpoints-using-cloudwatch-synthetics/)

 **Related videos:** 
+ [AWS re:Invent 2018: Monitor All Your Things: Amazon CloudWatch in Action with BBC ](https://www.youtube.com/watch?v=uuBuc6OAcVY)
+ [AWS re:Invent 2022 - Developing an observability strategy ](https://www.youtube.com/watch?v=Ub3ATriFapQ)
+ [AWS re:Invent 2022 - Observability best practices at Amazon ](https://www.youtube.com/watch?v=zZPzXEBW4P8)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Well-Architected Labs - Dependency Monitoring ](https://www.wellarchitectedlabs.com/operational-excellence/100_labs/100_dependency_monitoring/)

 **Related services:** 
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+ [AWS Health](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html)

# OPS04-BP05 Implement transaction traceability
<a name="ops_telemetry_dist_trace"></a>

Implement your application code and configure your workload components to emit events, which are started as a result of single logical operations and consolidated across various boundaries of your workload. Generate maps to see how traces flow across your workload and services. Gain insight into the relationships between components, and identify and analyze issues. Use the collected information to determine when a response is required and to assist you in identifying the factors contributing to an issue. 

 **Desired outcome:** 
+  Collect transaction traces across your workload to gain insight into the relationship between components. 
+  Generate maps to gain a better understanding of how transactions and events flow across your workload. 

 **Common anti-patterns:** 
+  You have implemented a serverless microservices architecture spanning multiple accounts. Your customers are experiencing intermittent performance issues. You are unable to discover which function or component is responsible because you lack transaction traceability. 
+ There is a performance bottleneck in your workload. Because you lack transaction traceability, you are unable to see the relationship between your application components and identify the bottleneck.
+  The identifier used for traces is not globally unique, resulting in a tracing collision when analyzing workload behavior. 

 **Benefits of establishing this best practice:** 
+  Understanding the flow of transactions across your workload provides insight into the expected behavior of your workload transactions. 
+  You can see variations from expected behavior across your workload and you can respond if necessary. 
+  You can pinpoint transactions by their unique generated identifier independent from where they were generated. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Design your application and workload to emit information about the flow of transactions across system components. Data to include in transactions are a globally unique transaction identifier, transaction stage, active component, and time to complete activity. Use this information to determine what is in progress, what is complete, and what the results of completed activities are. 

 **Customer example** 

 At AnyCompany Retail, all transactions have a globally unique UUID generated. This UUID is passed between microservices during transactions. The UUID is used to create transaction traces as users interact with the workload. A map of the workload topology is generated with the traces and is used to troubleshoot workload issues and improve performance. 

 **Implementation steps** 

1.  Instrument the applications in your workload to emit transaction traces. This can be done by generating a unique identifier for each transaction and passing the identifier between applications. 

   1.  You can use auto-instrumentation in the [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) to implement traces in your existing applications without modifying your application code. 

1.  Generate maps of your application topology. Use these maps to improve performance, gain insights, and aid in troubleshooting. 

   1.  [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) can generate maps of the applications in your workload. 

 **Level of effort for the implementation plan:** Medium. Implementing transaction traces may require moderate development effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - Application telemetry covers transaction traceability and handling and needs to be implementing first. 

 **Related documents:** 
+ [ Discover application issues and get notifications with AWS X-Ray Insights ](https://aws.amazon.com/blogs/mt/discover-application-issues-get-notifications-aws-x-ray-insights/)
+ [ How Wealthfront utilizes AWS X-Ray to analyze and debug distributed applications ](https://aws.amazon.com/blogs/mt/wealthfront-utilizes-aws-x-ray-analyze-debug-distributed-applications/)
+ [ New for AWS Distro for OpenTelemetry – Tracing Support is Now Generally Available ](https://aws.amazon.com/blogs/aws/new-for-aws-distro-for-opentelemetry-tracing-support-is-now-generally-available/)

 **Related videos:** 
+ [AWS re:Invent 2018: Deep Dive into AWS X-Ray: Monitor Modern Applications (DEV324) ](https://www.youtube.com/watch?v=5MQkX57eTh8)
+ [AWS re:Invent 2022 - Building observable applications with OpenTelemetry (BOA310) ](https://www.youtube.com/watch?v=efk8XFJrW2c)
+ [AWS re:Invent 2022 - Observability the open-source way (COP301-R) ](https://www.youtube.com/watch?v=2IJPpdp9xU0)
+ [ Capturing Trace Data with the AWS Distro for OpenTelemetry ](https://www.youtube.com/watch?v=837NtV0McOA)
+ [ Optimize Application Performance with AWS X-Ray](https://www.youtube.com/watch?v=5lIdNrrO_o8)

 **Related examples:** 
+ [AWS X-Ray Multi API Gateway Tracing Example ](https://github.com/aws-samples/aws-xray-multi-api-gateway-tracing-example)

 **Related services:** 
+  [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) 
+  [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)