# Operate
<a name="a-operate"></a>

**Topics**
+ [OPS 8. How do you utilize workload observability in your organization?](ops-08.md)
+ [OPS 9. How do you understand the health of your operations?](ops-09.md)
+ [OPS 10. How do you manage workload and operations events?](ops-10.md)

# OPS 8. How do you utilize workload observability in your organization?
<a name="ops-08"></a>

Ensure optimal workload health by leveraging observability. Utilize relevant metrics, logs, and traces to gain a comprehensive view of your workload's performance and address issues efficiently.

**Topics**
+ [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md)
+ [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md)
+ [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md)
+ [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md)
+ [OPS08-BP05 Create dashboards](ops_workload_observability_create_dashboards.md)

# OPS08-BP01 Analyze workload metrics
<a name="ops_workload_observability_analyze_workload_metrics"></a>

 After implementing application telemetry, regularly analyze the collected metrics. While latency, requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to prioritize the review of business outcome metrics. This ensures you're making data-driven decisions aligned with your business objectives. 

 **Desired outcome:** Accurate insights into workload performance that drive data-informed decisions, ensuring alignment with business objectives. 

 **Common anti-patterns:** 
+  Analyzing metrics in isolation without considering their impact on business outcomes. 
+  Over-reliance on technical metrics while sidelining business metrics. 
+  Infrequent review of metrics, missing out on real-time decision-making opportunities. 

 **Benefits of establishing this best practice:** 
+  Enhanced understanding of the correlation between technical performance and business outcomes. 
+  Improved decision-making process informed by real-time data. 
+  Proactive identification and mitigation of issues before they affect business outcomes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Leverage tools like Amazon CloudWatch to perform metric analysis. AWS services such as CloudWatch anomaly detection and Amazon DevOps Guru can be used to detect anomalies, especially when static thresholds are unknown or when patterns of behavior are more suited for anomaly detection. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Analyze and review:** Regularly review and interpret your workload metrics. 

   1.  Prioritize business outcome metrics over purely technical metrics. 

   1.  Understand the significance of spikes, drops, or patterns in your data. 

1.  **Utilize Amazon CloudWatch:** Use Amazon CloudWatch for a centralized view and deep-dive analysis. 

   1.  Configure CloudWatch dashboards to visualize your metrics and compare them over time. 

   1.  Use [percentiles in CloudWatch](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/) to get a clear view of metric distribution, which can help in defining SLAs and understanding outliers. 

   1.  Set up [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to identify unusual patterns without relying on static thresholds. 

   1.  Implement [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html) to monitor and troubleshoot applications that span multiple accounts within a Region. 

   1.  Use [CloudWatch Metric Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html) to query and analyze metric data across accounts and Regions, identifying trends and anomalies. 

   1.  Apply [CloudWatch Metric Math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) to transform, aggregate, or perform calculations on your metrics for deeper insights. 

1.  **Employ Amazon DevOps Guru:** Incorporate [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning-enhanced anomaly detection to identify early signs of operational issues for your serverless applications and remediate them before they impact your customers. 

1.  **Optimize based on insights:** Make informed decisions based on your metric analysis to adjust and improve your workloads. 

 **Level of effort for the Implementation Plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 

 **Related documents:** 
+ [ The Wheel Blog - Emphasizing the importance of continually reviewing metrics ](https://aws.amazon.com/blogs/opensource/the-wheel/)
+ [ Percentile are important ](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/)
+ [ Using AWS Cost Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html)
+ [ CloudWatch cross-account observability ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html)
+ [ Query your metrics with CloudWatch Metrics Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html)

 **Related videos:** 
+ [ Enable Cross-Account Observability in Amazon CloudWatch ](https://www.youtube.com/watch?v=lUaDO9dqISc)
+ [ Introduction to Amazon DevOps Guru ](https://www.youtube.com/watch?v=2uA8q-8mTZY)
+ [ Continuously Analyze Metrics using AWS Cost Anomaly Detection](https://www.youtube.com/watch?v=IpQYBuay5OE)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Gaining operation insights with AIOps using Amazon DevOps Guru ](https://catalog.us-east-1.prod.workshops.aws/workshops/f92df379-6add-4101-8b4b-38b788e1222b/en-US)

# OPS08-BP02 Analyze workload logs
<a name="ops_workload_observability_analyze_workload_logs"></a>

 Regularly analyzing workload logs is essential for gaining a deeper understanding of the operational aspects of your application. By efficiently sifting through, visualizing, and interpreting log data, you can continually optimize application performance and security. 

 **Desired outcome:** Rich insights into application behavior and operations derived from thorough log analysis, ensuring proactive issue detection and mitigation. 

 **Common anti-patterns:** 
+ Neglecting the analysis of logs until a critical issue arises.
+ Not using the full suite of tools available for log analysis, missing out on critical insights.
+  Solely relying on manual review of logs without leveraging automation and querying capabilities. 

 **Benefits of establishing this best practice:** 
+ Proactive identification of operational bottlenecks, security threats, and other potential issues.
+ Efficient utilization of log data for continuous application optimization.
+  Enhanced understanding of application behavior, aiding in debugging and troubleshooting. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) is a powerful tool for log analysis. Integrated features like CloudWatch Logs Insights and Contributor Insights make the process of deriving meaningful information from logs intuitive and efficient. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Set up CloudWatch Logs:** Configure applications and services to send logs to CloudWatch Logs. 

1.  **Set up CloudWatch Logs Insights:** Use [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to interactively search and analyze your log data. 

   1.  Craft queries to extract patterns, visualize log data, and derive actionable insights. 

1.  **Leverage Contributor Insights:** Use [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to identify top talkers in high cardinality dimensions like IP addresses or user-agents. 

1.  **Implement CloudWatch Logs metric filters:** Configure [CloudWatch log metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) to convert log data into actionable metrics. This allows you to set alarms or further analyze patterns. 

1.  **Regular review and refinement:** Periodically review your log analysis strategies to capture all relevant information and continually optimize application performance. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 

 **Related documents:** 
+ [ Analyzing Log Data with CloudWatch Logs Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html)
+ [ Using CloudWatch Contributor Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html)
+ [ Creating and Managing CloudWatch Logs Log Metric Filters ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html)

 **Related videos:** 
+ [ Analyze Log Data with CloudWatch Logs Insights ](https://www.youtube.com/watch?v=2s2xcwm8QrM)
+ [ Use CloudWatch Contributor Insights to Analyze High-Cardinality Data ](https://www.youtube.com/watch?v=ErWRBLFkjGI)

 **Related examples:** 
+ [ CloudWatch Logs Sample Queries ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

# OPS08-BP03 Analyze workload traces
<a name="ops_workload_observability_analyze_workload_traces"></a>

 Analyzing trace data is crucial for achieving a comprehensive view of an application's operational journey. By visualizing and understanding the interactions between various components, performance can be fine-tuned, bottlenecks identified, and user experiences enhanced. 

 **Desired outcome:** Achieve clear visibility into your application's distributed operations, enabling quicker issue resolution and an enhanced user experience. 

 **Common anti-patterns:** 
+  Overlooking trace data, relying solely on logs and metrics. 
+  Not correlating trace data with associated logs. 
+  Ignoring the metrics derived from traces, such as latency and fault rates. 

 **Benefits of establishing this best practice:** 
+  Improve troubleshooting and reduce mean time to resolution (MTTR). 
+  Gain insights into dependencies and their impact. 
+  Swift identification and rectification of performance issues. 
+  Leveraging trace-derived metrics for informed decision-making. 
+  Improved user experiences through optimized component interactions. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) offers a comprehensive suite for trace data analysis, providing a holistic view of service interactions, monitoring user activities, and detecting performance issues. Features like ServiceLens, X-Ray Insights, X-Ray Analytics, and Amazon DevOps Guru enhance the depth of actionable insights derived from trace data. 

### Implementation steps
<a name="implementation-steps"></a>

 The following steps offer a structured approach to effectively implementing trace data analysis using AWS services: 

1.  **Integrate AWS X-Ray:** Ensure X-Ray is integrated with your applications to capture trace data. 

1.  **Analyze X-Ray metrics:** Delve into metrics derived from X-Ray traces such as latency, request rates, fault rates, and response time distributions using the [service map](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-servicemap.html#xray-console-servicemap-view) to monitor application health. 

1.  **Use ServiceLens:** Leverage the [ServiceLens map](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_service_map.html) for enhanced observability of your services and applications. This allows for integrated viewing of traces, metrics, logs, alarms, and other health information. 

1.  **Enable X-Ray Insights:** 

   1.  Turn on [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) for automated anomaly detection in traces. 

   1.  Examine insights to pinpoint patterns and ascertain root causes, such as increased fault rates or latencies. 

   1.  Consult the insights timeline for a chronological analysis of detected issues. 

1.  **Use X-Ray Analytics:** [X-Ray Analytics](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html) allows you to thoroughly explore trace data, pinpoint patterns, and extract insights. 

1.  **Use groups in X-Ray:** Create groups in X-Ray to filter traces based on criteria such as high latency, allowing for more targeted analysis. 

1.  **Incorporate Amazon DevOps Guru:** Engage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) to benefit from machine learning models pinpointing operational anomalies in traces. 

1.  **Use CloudWatch Synthetics:** Use [CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_tracing.html) to create canaries for continually monitoring your endpoints and workflows. These canaries can integrate with X-Ray to provide trace data for in-depth analysis of the applications being tested. 

1.  **Use Real User Monitoring (RUM):** With [AWS X-Ray and CloudWatch RUM](https://docs.aws.amazon.com/xray/latest/devguide/xray-services-RUM.html), you can analyze and debug the request path starting from end users of your application through downstream AWS managed services. This helps you identify latency trends and errors that impact your users. 

1.  **Correlate with logs:** Correlate [trace data with related logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_troubleshooting.html#servicelens_troubleshooting_Nologs) within the X-Ray trace view for a granular perspective on application behavior. This allows you to view log events directly associated with traced transactions. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 

 **Related documents:** 
+ [ Using ServiceLens to Monitor Application Health ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html)
+ [ Exploring Trace Data with X-Ray Analytics ](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html)
+ [ Detecting Anomalies in Traces with X-Ray Insights ](https://docs.aws.amazon.com/xray/latest/devguide/xray-insights.html)
+ [ Continuous Monitoring with CloudWatch Synthetics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)

 **Related videos:** 
+ [ Analyze and Debug Applications Using Amazon CloudWatch Synthetics and AWS X-Ray](https://www.youtube.com/watch?v=s2WvaV2eDO4)
+ [ Use AWS X-Ray Insights ](https://www.youtube.com/watch?v=tl8OWHl6jxw)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Implementing X-Ray with AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html)
+ [ CloudWatch Synthetics Canary Templates ](https://github.com/aws-samples/cloudwatch-synthetics-canary-terraform)

# OPS08-BP04 Create actionable alerts
<a name="ops_workload_observability_create_alerts"></a>

 Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability. 

 **Desired outcome:** Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk. 

 **Common anti-patterns:** 
+  Setting up too many non-critical alerts, leading to alert fatigue. 
+  Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues. 
+  Neglecting to address root causes, leading to repetitive alerts for the same issue. 

 **Benefits of establishing this best practice:** 
+  Reduced alert fatigue by focusing on actionable and relevant alerts. 
+  Improved system uptime and reliability through proactive issue detection and mitigation. 
+  Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Determine key performance indicators (KPIs):** Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately. 

1.  **Implement anomaly detection:** 
   +  **Use CloudWatch anomaly detection:** Set up [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to automatically detect unusual patterns, ensuring alerts are only generated for genuine anomalies. 
   +  **Use X-Ray Insights:** 

     1.  Set up [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) to detect anomalies in trace data. 

     1.  Configure [notifications for X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications) to be alerted on detected issues. 
   +  **Integrate with DevOps Guru:** 

     1.  Leverage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning capabilities in detecting operational anomalies with existing data. 

     1.  Navigate to the [notification settings](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html#navigate-to-notification-settings) in DevOps Guru to set up anomaly alerts. 

1.  **Implement actionable alerts:** Design alerts that provide adequate information for immediate action. 

1.  **Reduce alarm fatigue:** Minimize non-critical alerts. Overwhelming teams with numerous insignificant alerts can lead to oversight of critical issues and diminish the overall effectiveness of the alerting mechanism. 

1.  **Set up composite alarms:** Use [Amazon CloudWatch composite alarms](https://aws.amazon.com/blogs/mt/improve-monitoring-efficiency-using-amazon-cloudwatch-composite-alarms-2/) to consolidate multiple alarms. 

1.  **Integrate with alerting tools:** Incorporate tools like [Ops Genie](https://www.atlassian.com/software/opsgenie) and [PagerDuty](https://www.pagerduty.com/). 

1.  **Engage Amazon Q Developer in chat applications** Integrate [Amazon Q Developer in chat applications](https://aws.amazon.com/chatbot/)to relay alerts to Chime, Microsoft Teams, and Slack. 

1.  **Alert based on logs:** Use [log metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) in CloudWatch to create alarms based on specific log events. 

1.  **Review and iterate:** Regularly revisit and refine alert configurations. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 

 **Related documents:** 
+ [ Using Amazon CloudWatch Alarms ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
+ [ Create a composite alarm ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html)
+ [ Create a CloudWatch alarm based on anomaly detection ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html)
+ [ DevOps Guru Notifications ](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html)
+ [ X-Ray Insights notifications ](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications)
+ [ OMonitor, operate, and troubleshoot your AWS resources with interactive ChatOps ](https://aws.amazon.com/chatbot/)
+ [ Amazon CloudWatch Integration Guide \$1 PagerDuty ](https://support.pagerduty.com/docs/amazon-cloudwatch-integration-guide)
+ [ Integrate OpsGenie with Amazon CloudWatch ](https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-amazon-cloudwatch/)

 **Related videos:** 
+ [ Create Composite Alarms in Amazon CloudWatch ](https://www.youtube.com/watch?v=0LMQ-Mu-ZCY)
+ [ Amazon Q Developer in chat applications Overview ](https://www.youtube.com/watch?v=0jUSEfHbTYk)
+ [AWS on Air ft. Mutative Commands in Amazon Q Developer in chat applications ](https://www.youtube.com/watch?v=u2pkw2vxrtk)

 **Related examples:** 
+ [ Alarms, incident management, and remediation in the cloud with Amazon CloudWatch ](https://aws.amazon.com/blogs/mt/alarms-incident-management-and-remediation-in-the-cloud-with-amazon-cloudwatch/)
+ [ Tutorial: Creating an Amazon EventBridge rule that sends notifications to Amazon Q Developer in chat applications ](https://docs.aws.amazon.com/chatbot/latest/adminguide/create-eventbridge-rule.html)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

# OPS08-BP05 Create dashboards
<a name="ops_workload_observability_create_dashboards"></a>

 Dashboards are the human-centric view into the telemetry data of your workloads. While they provide a vital visual interface, they should not replace alerting mechanisms, but complement them. When crafted with care, not only can they offer rapid insights into system health and performance, but they can also present stakeholders with real-time information on business outcomes and the impact of issues. 

 **Desired outcome:** Clear, actionable insights into system and business health using visual representations. 

 **Common anti-patterns:** 
+  Overcomplicating dashboards with too many metrics. 
+  Relying on dashboards without alerts for anomaly detection. 
+  Not updating dashboards as workloads evolve. 

 **Benefits of establishing this best practice:** 
+  Immediate visibility into critical system metrics and KPIs. 
+  Enhanced stakeholder communication and understanding. 
+  Rapid insight into the impact of operational issues. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 **Business-centric dashboards** 

 Dashboards tailored to business KPIs engage a wider array of stakeholders. While these individuals might not be interested in system metrics, they are keen on understanding the business implications of these numbers. A business-centric dashboard ensures that all technical and operational metrics being monitored and analyzed are in sync with overarching business goals. This alignment provides clarity, ensuring everyone is on the same page regarding what's essential and what's not. Additionally, dashboards that highlight business KPIs tend to be more actionable. Stakeholders can quickly understand the health of operations, areas that need attention, and the potential impact on business outcomes. 

 With this in mind, when creating your dashboards, ensure that there's a balance between technical metrics and business KPIs. Both are vital, but they cater to different audiences. Ideally, you should have dashboards that provide a holistic view of the system's health and performance while also emphasizing key business outcomes and their implications. 

 Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different AWS Regions and accounts. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Create a basic dashboard:** [Create a new dashboard in CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html), giving it a descriptive name. 

1.  **Use Markdown widgets:** Before diving into metrics, use [Markdown widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_text_dashboard.html) to add textual context at the top of your dashboard. This should explain what the dashboard covers, the significance of the represented metrics, and can also contain links to other dashboards and troubleshooting tools. 

1.  **Create dashboard variables:** [Incorporate dashboard variables](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_dashboard_variables.html) where appropriate to allow for dynamic and flexible dashboard views. 

1.  **Create metrics widgets:** [Add metric widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-and-work-with-widgets.html) to visualize various metrics your application emits, tailoring these widgets to effectively represent system health and business outcomes. 

1.  **Log Insights queries:** Utilize [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_ExportQueryResults.html) to derive actionable metrics from your logs and display these insights on your dashboard. 

1.  **Set up alarms:** Integrate [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_alarm_dashboard.html) into your dashboard for a quick view of any metrics breaching their thresholds. 

1.  **Use Contributor Insights:** Incorporate [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights-ViewReports.html) to analyze high-cardinality fields and get a clearer understanding of your resource's top contributors. 

1.  **Design custom widgets:** For specific needs not met by standard widgets, consider creating [custom widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_custom_widget_dashboard.html). These can pull from various data sources or represent data in unique ways. 

1.  **Iterate and refine:** As your application evolves, regularly revisit your dashboard to ensure its relevance. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 

 **Related documents:** 
+ [ Building Dashboards for Operational Visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ Using Amazon CloudWatch Dashboards ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)

 **Related videos:** 
+ [ Create Cross Account & Cross Region CloudWatch Dashboards ](https://www.youtube.com/watch?v=eIUZdaqColg)
+ [AWS re:Invent 2021 - Gain enterprise visibility with AWS Cloud operation dashboards ](https://www.youtube.com/watch?v=NfMpYiGwPGo)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Application Monitoring with Amazon CloudWatch ](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch/)

# OPS 9. How do you understand the health of your operations?
<a name="ops-09"></a>

 Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action. 

**Topics**
+ [OPS09-BP01 Measure operations goals and KPIs with metrics](ops_operations_health_measure_ops_goals_kpis.md)
+ [OPS09-BP02 Communicate status and trends to ensure visibility into operation](ops_operations_health_communicate_status_trends.md)
+ [OPS09-BP03 Review operations metrics and prioritize improvement](ops_operations_health_review_ops_metrics_prioritize_improvement.md)

# OPS09-BP01 Measure operations goals and KPIs with metrics
<a name="ops_operations_health_measure_ops_goals_kpis"></a>

 Obtain goals and KPIs that define operations success from your organization and determine that metrics reflect these. Set baselines as a point of reference and reevaluate regularly. Develop mechanisms to collect these metrics from teams for evaluation. 

 **Desired outcome:** 
+  The goals and KPIs for the organization's operations teams have been published and shared. 
+  Metrics that reflect these KPIs are established. Examples may include: 
  +  Ticket queue depth or average age of ticket 
  +  Ticket count grouped by type of issue 
  +  Time spent working issues with or without a standardized operating procedure (SOP) 
  +  Amount of time spent recovering from a failed code push 
  +  Call volume 

 **Common anti-patterns:** 
+  Deployment deadlines are missed because developers are pulled away to perform troubleshooting tasks. Development teams argue for more personnel, but cannot quantify how many they need because the time taken away cannot be measured. 
+  A Tier 1 desk was set up to handle user calls. Over time, more workloads were added, but no headcount was allocated to the Tier 1 desk. Customer satisfaction suffers as call times increase and issues go longer without resolution, but management sees no indicators of such, preventing any action. 
+  A problematic workload has been handed off to a separate operations team for upkeep. Unlike other workloads, this new one was not supplied with proper documentation and runbooks. As such, teams spend longer troubleshooting and addressing failures. However, there are no metrics documenting this, which makes accountability difficult. 

 **Benefits of establishing this best practice:** Where workload monitoring shows the state of our applications and services, monitoring operations teams provide owners gain insight into changes among the consumers of those workloads, such as shifting business needs. Measure the effectiveness of these teams and evaluate them against business goals by creating metrics that can reflect the state of operations. Metrics can highlight support issues or identify when drifts occur away from a service level target. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Schedule time with business leaders and stakeholders to determine the overall goals of the service. Determine what the tasks of various operations teams should be and what challenges they could be approached with. Using these, brainstorm key performance indicators (KPIs) that might reflect these operations goals. These might be customer satisfaction, time from feature conception to deployment, average issue resolution time, and others. 

 Working from KPIs, identify the metrics and sources of data that might reflect these goals best. Customer satisfaction may be an combination of various metrics such as call wait or response times, satisfaction scores, and types of issues raised. Deployment times may be the sum of time needed for testing and deployment, plus any post-deployment fixes that needed to be added. Statistics showing the time spent on different types of issues (or the counts of those issues) can provide a window into where targeted effort is needed. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Quick - Using KPIs ](https://docs.aws.amazon.com/quicksight/latest/user/kpi.html)
+ [ Amazon CloudWatch - Using Metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)
+ [ Building Dashboards ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ How to track your cost optimization KPIs with KPI Dashboard ](https://aws.amazon.com/blogs/aws-cloud-financial-management/how-to-track-your-cost-optimization-kpis-with-the-kpi-dashboard/)

# OPS09-BP02 Communicate status and trends to ensure visibility into operation
<a name="ops_operations_health_communicate_status_trends"></a>

 Knowing the state of your operations and its trending direction is necessary to identify when outcomes may be at risk, whether or not added work can be supported, or the effects that changes have had to your teams. During operations events, having status pages that users and operations teams can refer to for information can reduce pressure on communication channels and disseminate information proactively. 

 **Desired outcome:** 
+  Operations leaders have insight at a glance to see what sort of call volumes their teams are operating under and what efforts may be under way, such as deployments. 
+  Alerts are disseminated to stakeholders and user communities when impacts to normal operations occur. 
+  Organization leadership and stakeholders can check a status page in response to an alert or impact, and obtain information surrounding an operational event, such as points of contact, ticket information, and estimated recovery times. 
+  Reports are made available to leadership and other stakeholders to show operations statistics such as call volumes over a period of time, user satisfaction scores, numbers of outstanding tickets and their ages. 

 **Common anti-patterns:** 
+  A workload goes down, leaving a service unavailable. Call volumes spike as users request to know what's going on. Managers add to the volume requesting to know who's working an issue. Various operations teams duplicate efforts in trying to investigate. 
+  A desire for a new capability leads to several personnel being reassigned to an engineering effort. No backfill is provided, and issue resolution times spike. This information is not captured, and only after several weeks and dissatisfied user feedback does leadership become aware of the issue. 

 **Benefits of establishing this best practice:** During operational events where the business is impacted, much time and energy can be wasted querying information from various teams attempting to understand the situation. By establishing widely-disseminated status pages and dashboards, stakeholders can quickly obtain information such as whether or not an issue was detected, who has lead on the issue, or when a return to normal operations may be expected. This frees team members from spending too much time communicating status to others and more time addressing issues. 

 In addition, dashboards and reports can provide insights to decision-makers and stakeholders to see how operations teams are able to respond to business needs and how their resources are being allocated. This is crucial for determining if adequate resources are in place to support the business. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Build dashboards that show the current key metrics for your ops teams, and make them readily accessible both to operations leaders and management. 

 Build status pages that can be updated quickly to show when an incident or event is unfolding, who has ownership and who is coordinating the response. Share any steps or workarounds that users should consider on this page, and disseminate the location widely. Encourage users to check this location first when confronted with an unknown issue. 

 Collect and provide reports that show the health of operations over time, and distribute this to leaders and decision makers to illustrate the work of operations along with challenges and needs. 

 Share between teams these metrics and reports that best reflect goals and KPIs and where they have been influential in driving change. Dedicate time to these activities to elevate the importance of operations inside of and between teams. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Measure Progress ](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cloud-operating-model/measure-progress.html)
+ [ Building dashboards for operation visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)

 **Related solutions:** 
+ [ Data Operations ](https://aws.amazon.com/solutions/app-development/data-operations)

# OPS09-BP03 Review operations metrics and prioritize improvement
<a name="ops_operations_health_review_ops_metrics_prioritize_improvement"></a>

 Setting aside dedicated time and resources for reviewing the state of operations ensures that serving the day-to-day line of business remains a priority. Pull together operations leaders and stakeholders to regularly review metrics, reaffirm or modify goals and objectives, and prioritize improvements. 

 **Desired outcome:** 
+  Operations leaders and staff regularly meet to review metrics over a given reporting period. Challenges are communicated, wins are celebrated, and lessons learned are shared. 
+  Stakeholders and business leaders are regularly briefed on the state of operations and solicited for input regarding goals, KPIs, and future initiatives. Tradeoffs between service delivery, operations, and maintenance are discussed and placed into context. 

 **Common anti-patterns:** 
+  A new product is launched, but the Tier 1 and Tier 2 operations teams are not adequately trained to support or given additional staff. Metrics that show the decrease in ticket resolution times and increase in incident volumes are not seen by leaders. Action is taken weeks later when subscription numbers start to fall as discontent users move off the platform. 
+  A manual process for performing maintenance on a workload has been in place for a long time. While a desire to automate has been present, this was a low priority given the low importance of the system. Over time however, the system has grown in importance and now these manual processes consume a majority of operations' time. No resources are scheduled for providing increased tooling to operations, leading to staff burnout as workloads increase. Leadership becomes aware once it's reported that staff are leaving for other competitors. 

 **Benefits of establishing this best practice:** In some organizations, it can become a challenge to allocate the same time and attention that is afforded to service delivery and new products or offerings. When this occurs, the line of business can suffer as the level of service expected slowly deteriorates. This is because operations does not change and evolve with the growing business, and can soon be left behind. Without regular review into the insights operations collects, the risk to the business may become visible only when it's too late. By allocating time to review metrics and procedures both among the operations staff and with leadership, the crucial role operations plays remains visible, and risks can be identified long before they reach critical levels. Operations teams get better insight into impending business changes and initiatives, allowing for proactive efforts to be undertaken. Leadership visibility into operations metrics showcases the role that these teams play in customer satisfaction, both internal and external, and let them better weigh choices for priorities, or ensure that operations has the time and resources to change and evolve with new business and workload initiatives. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Dedicate time to review operations metrics between stakeholders and operations teams and review report data. Place these reports in the contexts of the organizations goals and objectives to determine if they're being met. Identify sources of ambiguity where goals are not clear, or where there may be conflicts between what is asked for and what is given. 

 Identify where time, people, and tools can aid in operations outcomes. Determine which KPIs this would impact and what targets for success should be. Revisit regularly to ensure operations is resourced sufficiently to support the line of business. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Amazon Athena ](https://aws.amazon.com/athena/)
+ [ Amazon CloudWatch metrics and dimensions reference ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html)
+ [ Amazon Quick ](https://aws.amazon.com/quicksight/)
+ [AWS Glue](https://aws.amazon.com/glue/)
+ [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html)
+ [ Collect metrics and logs from Amazon EC2 instances and on-premises servers with the Amazon CloudWatch Agent ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html)
+ [ Using Amazon CloudWatch metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)

# OPS 10. How do you manage workload and operations events?
<a name="ops-10"></a>

 Prepare and validate procedures for responding to events to minimize their disruption to your workload. 

**Topics**
+ [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md)
+ [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md)
+ [OPS10-BP03 Prioritize operational events based on business impact](ops_event_response_prioritize_events.md)
+ [OPS10-BP04 Define escalation paths](ops_event_response_define_escalation_paths.md)
+ [OPS10-BP05 Define a customer communication plan for outages](ops_event_response_push_notify.md)
+ [OPS10-BP06 Communicate status through dashboards](ops_event_response_dashboards.md)
+ [OPS10-BP07 Automate responses to events](ops_event_response_auto_event_response.md)

# OPS10-BP01 Use a process for event, incident, and problem management
<a name="ops_event_response_event_incident_problem_process"></a>

Your organization has processes to handle events, incidents, and problems. *Events* are things that occur in your workload but may not need intervention. *Incidents* are events that require intervention. *Problems* are recurring events that require intervention or cannot be resolved. You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately.

When incidents and problems happen to your workload, you need processes to handle them. How will you communicate the status of the event with stakeholders? Who oversees leading the response? What are the tools that you use to mitigate the event? These are examples of some of the questions you need answer to have a solid response process. 

Processes must be documented in a central location and available to anyone involved in your workload. If you don’t have a central wiki or document store, a version control repository can be used. You’ll keep these plans up to date as your processes evolve. 

Problems are candidates for automation. These events take time away from your ability to innovate. Start with building a repeatable process to mitigate the problem. Over time, focus on automating the mitigation or fixing the underlying issue. This frees up time to devote to making improvements in your workload. 

**Desired outcome:** Your organization has a process to handle events, incidents, and problems. These processes are documented and stored in a central location. They are updated as processes change. 

**Common anti-patterns:** 
+  An incident happens on the weekend and the on-call engineer doesn’t know what to do. 
+  A customer sends you an email that the application is down. You reboot the server to fix it. This happens frequently. 
+  There is an incident with multiple teams working independently to try to solve it. 
+  Deployments happen in your workload without being recorded. 

 **Benefits of establishing this best practice:** 
+  You have an audit trail of events in your workload. 
+  Your time to recover from an incident is decreased. 
+  Team members can resolve incidents and problems in a consistent manner. 
+  There is a more consolidated effort when investigating an incident. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed. 

 **Customer example** 

AnyCompany Retail has a portion of their internal wiki devoted to processes for event, incident, and problem management. All events are sent to [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html). Problems are identified as OpsItems in [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) and prioritized to fix, reducing undifferentiated labor. As processes change, they’re updated in their internal wiki. They use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to manage incidents and coordinate mitigation efforts. 

## Implementation steps
<a name="implementation-steps"></a>

1.  Events 
   +  Track events that happen in your workload, even if no human intervention is required. 
   +  Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching. 
   +  You can use services like [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) or [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) to generate custom events for tracking. 

1.  Incidents 
   +  Start by defining the communication plan for incidents. What stakeholders must be informed? How will you keep them in the loop? Who oversees coordinating efforts? We recommend standing up an internal chat channel for communication and coordination. 
   +  Define escalation paths for the teams that support your workload, especially if the team doesn’t have an on-call rotation. Based on your support level, you can also file a case with Support. 
   +  Create a playbook to investigate the incident. This should include the communication plan and detailed investigation steps. Include checking the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) in your investigation. 
   +  Document your incident response plan. Communicate the incident management plan so internal and external customers understand the rules of engagement and what is expected of them. Train your team members on how to use it. 
   +  Customers can use [Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to set up and manage their incident response plan. 
   +  Enterprise Support customers can request the [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement. 

1.  Problems 
   +  Problems must be identified and tracked in your ITSM system. 
   +  Identify all known problems and prioritize them by effort to fix and impact to workload.   
![\[Action priority matrix for prioritizing problems.\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/impact-effort-chart.png)
   +  Solve problems that are high impact and low effort first. Once those are solved, move on to problems to that fall into the low impact low effort quadrant. 
   +  You can use [Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) to identify these problems, attach runbooks to them, and track them. 

**Level of effort for the implementation plan:** Medium. You need both a process and tools to implement this best practice. Document your processes and make them available to anyone associated with the workload. Update them frequently. You have a process for managing problems and mitigating them or fixing them. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Known problems need an associated runbook so that mitigation efforts are consistent.
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Incidents must be investigated using playbooks. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Always conduct a postmortem after you recover from an incident. 

 **Related documents:** 
+  [Atlassian - Incident management in the age of DevOps](https://www.atlassian.com/incident-management/devops) 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+  [Incident Management in the Age of DevOps and SRE](https://www.infoq.com/presentations/incident-management-devops-sre/) 
+  [PagerDuty - What is Incident Management?](https://www.pagerduty.com/resources/learn/what-is-incident-management/) 

 **Related videos:** 
+  [AWS re:Invent 2020: Incident management in a distributed organization](https://www.youtube.com/watch?v=tyS1YDhMVos) 
+  [AWS re:Invent 2021 - Building next-gen applications with event-driven architectures](https://www.youtube.com/watch?v=U5GZNt0iMZY) 
+  [AWS Supports You \$1 Exploring the Incident Management Tabletop Exercise](https://www.youtube.com/watch?v=0m8sGDx-pRM) 
+  [AWS Systems Manager Incident Manager - AWS Virtual Workshops](https://www.youtube.com/watch?v=KNOc0DxuBSY) 
+  [AWS What's Next ft. Incident Manager \$1 AWS Events](https://www.youtube.com/watch?v=uZL-z7cII3k) 

 **Related examples:** 
+  [AWS Management and Governance Tools Workshop - OpsCenter](https://mng.workshop.aws/ssm/capability_hands-on_labs/opscenter.html) 
+  [AWS Proactive Services – Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [Building an event-driven application with Amazon EventBridge](https://aws.amazon.com/blogs/compute/building-an-event-driven-application-with-amazon-eventbridge/) 
+  [Building event-driven architectures on AWS](https://catalog.us-east-1.prod.workshops.aws/workshops/63320e83-6abc-493d-83d8-f822584fb3cb/en-US/) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 
+  [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 

# OPS10-BP02 Have a process per alert
<a name="ops_event_response_process_per_alert"></a>

 Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert. This ensures effective and prompt responses to operations events and prevents actionable events from being obscured by less valuable notifications. 

 **Common anti-patterns:** 
+  Your monitoring system presents you a stream of approved connections along with other messages. The volume of messages is so large that you miss periodic error messages that require your intervention. 
+  You receive an alert that the website is down. There is no defined process for when this happens. You are forced to take an ad hoc approach to diagnose and resolve the issue. Developing this process as you go extends the time to recovery. 

 **Benefits of establishing this best practice:** By alerting only when action is required, you prevent low value alerts from concealing high value alerts. By having a process for every actionable alert, you create a consistent and prompt response to events in your environment. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion. Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes. By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example, automatic scaling might be applied to scale a web front end, but the operations team might be accountable to ensure that the automatic scaling rules and limits are appropriate for workload needs. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

# OPS10-BP03 Prioritize operational events based on business impact
<a name="ops_event_response_prioritize_events"></a>

 Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, or damage to reputation or trust. 

 **Common anti-patterns:** 
+  You receive a support request to add a printer configuration for a user. While working on the issue, you receive a support request stating that your retail site is down. After completing the printer configuration for your user, you start work on the website issue. 
+  You get notified that both your retail website and your payroll system are down. You don't know which one should get priority. 

 **Benefits of establishing this best practice:** Prioritizing responses to the incidents with the greatest impact on the business notifies your management of that impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, regulatory violations, or damage to reputation or trust. 

# OPS10-BP04 Define escalation paths
<a name="ops_event_response_define_escalation_paths"></a>

 Define escalation paths in your runbooks and playbooks, including what initiates escalation, and procedures for escalation. Specifically identify owners for each action to ensure effective and prompt responses to operations events. 

 Identify when a human decision is required before an action is taken. Work with decision makers to have that decision made in advance, and the action preapproved, so that MTTR is not extended waiting for a response. 

 **Common anti-patterns:** 
+  Your retail site is down. You don't understand the runbook for recovering the site. You start calling colleagues hoping that someone will be able to help you. 
+  You receive a support case for an unreachable application. You don't have permissions to administer the system. You don't know who does. You attempt to contact the system owner that opened the case and there is no response. You have no contacts for the system and your colleagues are not familiar with it. 

 **Benefits of establishing this best practice:** By defining escalations, what initiates the escalation, and procedures for escalation you provide the systematic addition of resources to an incident at an appropriate rate for the impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define escalation paths: Define escalation paths in your runbooks and playbooks, including what starts escalation, and procedures for escalation. For example, escalation of an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed. Another example of an appropriate escalation path is from senior support engineers to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time has elapsed. Specifically identify owners for each action to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network connectivity provider or a software vendor. Escalations can include identified authorized decision makers for impacted systems. 

# OPS10-BP05 Define a customer communication plan for outages
<a name="ops_event_response_push_notify"></a>

 Define and test a communication plan for system outages that you can rely on to keep your customers and stakeholders informed during outages. Communicate directly with your users both when the services they use are impacted and when services return to normal. 

 **Desired outcome:** 
+  You have a communication plan for situations ranging from scheduled maintenance to large unexpected failures, including invocation of disaster recovery plans. 
+  In your communications, you provide clear and transparent information about systems issues to help customers avoid second guessing the performance of their systems. 
+  You use custom error messages and status pages to reduce the spike in help desk requests and keep users informed. 
+  The communication plan is regularly tested to verify that it will perform as intended when a real outage occurs. 

 **Common anti-patterns:** 
+ A workload outage occurs but you have no communication plan. Users overwhelm your trouble ticket system with requests because they have no information on the outage.
+ You send an email notification to your users during an outage. It doesn’t contain a timeline for restoration of service so users cannot plan around the outage.
+ There is a communication plan for outages but it has never been tested. An outage occurs and the communication plan fails because a critical step was missed that could have been caught in testing.
+  During an outage, you send a notification to users with too many technical details and information under your AWS NDA. 

 **Benefits of establishing this best practice:** 
+  Maintaining communication during outages ensures that customers are provided with visibility of progress on issues and estimated time to resolution. 
+  Developing a well-defined communications plan verifies that your customers and end users are well informed so they can take required additional steps to mitigate the impact of outages. 
+  With proper communications and increased awareness of planned and unplanned outages, you can improve customer satisfaction, limit unintended reactions, and drive customer retention. 
+  Timely and transparent system outage communication builds confidence and establishes trust needed to maintain relationships between you and your customers. 
+  A proven communication strategy during an outage or crisis reduces speculation and gossip that could hinder your ability to recover. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Communication plans that keep your customers informed during outages are holistic and cover multiple interfaces including customer facing error pages, custom API error messages, system status banners, and health status pages. If your system includes registered users, you can communicate over messaging channels such as email, SMS or push notifications to send personalized message content to your customers. 

 **Customer communication tools** 

 As a first line of defense, web and mobile applications should provide friendly and informative error messages during an outage as well as have the ability to redirect traffic to a status page. [Amazon CloudFront](https://aws.amazon.com/cloudfront/) is a fully managed content delivery network (CDN) that includes capabilities to define and serve custom error content. Custom error pages in CloudFront are a good first layer of customer messaging for component level outages. CloudFront can also simplify managing and activating a status page to intercept all requests during planned or unplanned outages. 

 Custom API error messages can help detect and reduce impact when outages are isolated to discrete services. [Amazon API Gateway](https://aws.amazon.com/api-gateway/) allows you to configure custom responses for your REST APIs. This allows you to provide clear and meaningful messaging to API consumers when API Gateway is not able to reach backend services. Custom messages can also be used to support outage banner content and notifications when a particular system feature is degraded due to service tier outages. 

 Direct messaging is the most personalized type of customer messaging. [Amazon Pinpoint](https://aws.amazon.com/pinpoint/) is a managed service for scalable multichannel communications. Amazon Pinpoint allows you to build campaigns that can broadcast messages widely across your impacted customer base over SMS, email, voice, push notifications, or custom channels you define. When you manage messaging with Amazon Pinpoint, message campaigns are well defined, testable, and can be intelligently applied to targeted customer segments. Once established, campaigns can be scheduled or started by events and they can easily be tested. 

 **Customer example** 

 When the workload is impaired, AnyCompany Retail sends out an email notification to their users. The email describes what business functionality is impaired and provides a realistic estimate of when service will be restored. In addition, they have a status page that shows real-time information about the health of their workload. The communication plan is tested in a development environment twice per year to validate that it is effective. 

 **Implementation steps** 

1.  Determine the communication channels for your messaging strategy. Consider the architectural aspects of your application and determine the best strategy for delivering feedback to your customers. This could include one or more of the guidance strategies outlined including error and status pages, custom API error responses, or direct messaging. 

1.  Design status pages for your application. If you’ve determined that status or custom error pages are suitable for your customers, you’ll need to design your content and messaging for those pages. Error pages explain to users why an application is not available, when it may become available again, and what they can do in the meantime. If your application uses Amazon CloudFront you can serve [custom error responses](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/GeneratingCustomErrorResponses.html) or use Lambda at Edge to [translate errors](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-examples.html#lambda-examples-update-error-status-examples) and rewrite page content. CloudFront also makes it possible to swap destinations from your application content to a static [Amazon S3](https://aws.amazon.com/s3/) content origin containing your maintenance or outage status page . 

1.  Design the correct set of API error statuses for your service. Error messages produced by API Gateway when it can’t reach backend services, as well as service tier exceptions, may not contain friendly messages suitable for display to end users. Without having to make code changes to your backend services, you can configure API Gateway [custom error responses](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html) to map HTTP response codes to curated API error messages. 

1.  Design messaging from a business perspective so that it is relevant to end users for your system and does not contain technical details. Consider your audience and align your messaging. For example, you may steer internal users towards a workaround or manual process that leverages alternate systems. External users may be asked to wait until the system is restored, or subscribe to updates to receive a notification once the system is restored. Define approved messaging for multiple scenarios including unexpected outages, planned maintenance, and partial system failures where a particular feature may be degraded or unavailable. 

1.  Templatize and automate your customer messaging. Once you have established your message content, you can use [Amazon Pinpoint](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html) or other tools to automate your messaging campaign. With Amazon Pinpoint you can create customer target segments for specific affected users and transform messages into templates. Review the [Amazon Pinpoint tutorial](https://docs.aws.amazon.com/pinpoint/latest/developerguide/tutorials.html) to get an understanding of how-to setup a messaging campaign. 

1.  Avoiding tightly coupling messaging capabilities to your customer facing system. Your messaging strategy should not have hard dependencies on system data stores or services to verify that you can successfully send messages when you experience outages. Consider building the ability to send messages from more than [one Availability Zone or Region](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_fault_isolation_multiaz_region_system.html) for messaging availability. If you are using AWS services to send messages, leverage data plane operations over [control plane operation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_avoid_control_plane.html) to invoke your messaging. 

 **Level of effort for the implementation plan:** High. Developing a communication plan, and the mechanisms to send it, can require a significant effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md) - Your communication plan should have a runbook associated with it so that your personnel know how to respond. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) - After an outage, conduct post-incident analysis to identify mechanisms to prevent another outage. 

 **Related documents:** 
+ [ Error Handling Patterns in Amazon API Gateway and AWS Lambda](https://aws.amazon.com/blogs/compute/error-handling-patterns-in-amazon-api-gateway-and-aws-lambda/)
+ [ Amazon API Gateway responses ](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html#supported-gateway-response-types)

 **Related examples:** 
+ [AWS Health Dashboard ](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/)
+ [ Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region ](https://aws.amazon.com/message/12721/)

 **Related services:** 
+ [AWS Support](https://aws.amazon.com/premiumsupport/)
+ [AWS Customer Agreement ](https://aws.amazon.com/agreement/)
+ [ Amazon CloudFront ](https://aws.amazon.com/cloudfront/)
+ [ Amazon API Gateway ](https://aws.amazon.com/api-gateway/)
+ [ Amazon Pinpoint ](https://aws.amazon.com/pinpoint/)
+ [ Amazon S3 ](https://aws.amazon.com/s3/)

# OPS10-BP06 Communicate status through dashboards
<a name="ops_event_response_dashboards"></a>

 Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. 

 You can create dashboards using [Amazon CloudWatch Dashboards](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) on customizable home pages in the CloudWatch console. Using business intelligence services such as [Quick](https://aws.amazon.com/quicksight/) you can create and publish interactive dashboards of your workload and operational health (for example, order rates, connected users, and transaction times). Create Dashboards that present system and business-level views of your metrics. 

 **Common anti-patterns:** 
+  Upon request, you run a report on the current utilization of your application for management. 
+  During an incident, you are contacted every twenty minutes by a concerned system owner wanting to know if it is fixed yet. 

 **Benefits of establishing this best practice:** By creating dashboards, you create self-service access to information helping your customers to informed themselves and determine if they need to take action. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Health Dashboard. 
  +  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

# OPS10-BP07 Automate responses to events
<a name="ops_event_response_auto_event_response"></a>

 Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 

 There are multiple ways to automate runbook and playbook actions on AWS. To respond to an event from a state change in your AWS resources, or from your own custom events, you should create [CloudWatch Events rules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) to initiate responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation). 

 To respond to a metric that crosses a threshold for a resource (for example, wait time), you should create [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) to perform one or more actions using Amazon EC2 actions, Auto Scaling actions, or to send a notification to an Amazon SNS topic. If you need to perform custom actions in response to an alarm, invoke Lambda through an Amazon SNS notification. Use Amazon SNS to publish event notifications and escalation messages to keep people informed. 

 AWS also supports third-party systems through the AWS service APIs and SDKs. There are a number of monitoring tools provided by AWS Partners and third parties that allow for monitoring, notifications, and responses. Some of these tools include New Relic, Splunk, Loggly, SumoLogic, and Datadog. 

 You should keep critical manual procedures available for use when automated procedures fail 

 **Common anti-patterns:** 
+  A developer checks in their code. This event could have been used to start a build and then perform testing but instead nothing happens. 
+  Your application logs a specific error before it stops working. The procedure to restart the application is well understood and could be scripted. You could use the log event to invoke a script and restart the application. Instead, when the error happens at 3am Sunday morning, you are woken up as the on-call resource responsible to fix the system. 

 **Benefits of establishing this best practice:** By using automated responses to events, you reduce the time to respond and limit the introduction of errors from manual activities. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
  +  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
  +  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 
+  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
+  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

 **Related examples:**