

# Operate
Operate

**Topics**
+ [

# OPS 8. How do you utilize workload observability in your organization?
](ops-08.md)
+ [

# OPS 9. How do you understand the health of your operations?
](ops-09.md)
+ [

# OPS 10. How do you manage workload and operations events?
](ops-10.md)

# OPS 8. How do you utilize workload observability in your organization?


Ensure optimal workload health by leveraging observability. Utilize relevant metrics, logs, and traces to gain a comprehensive view of your workload's performance and address issues efficiently.

**Topics**
+ [

# OPS08-BP01 Analyze workload metrics
](ops_workload_observability_analyze_workload_metrics.md)
+ [

# OPS08-BP02 Analyze workload logs
](ops_workload_observability_analyze_workload_logs.md)
+ [

# OPS08-BP03 Analyze workload traces
](ops_workload_observability_analyze_workload_traces.md)
+ [

# OPS08-BP04 Create actionable alerts
](ops_workload_observability_create_alerts.md)
+ [

# OPS08-BP05 Create dashboards
](ops_workload_observability_create_dashboards.md)

# OPS08-BP01 Analyze workload metrics
OPS08-BP01 Analyze workload metrics

 After implementing application telemetry, regularly analyze the collected metrics. While latency, requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to prioritize the review of business outcome metrics. This ensures you're making data-driven decisions aligned with your business objectives. 

 **Desired outcome:** Accurate insights into workload performance that drive data-informed decisions, ensuring alignment with business objectives. 

 **Common anti-patterns:** 
+  Analyzing metrics in isolation without considering their impact on business outcomes. 
+  Over-reliance on technical metrics while sidelining business metrics. 
+  Infrequent review of metrics, missing out on real-time decision-making opportunities. 

 **Benefits of establishing this best practice:** 
+  Enhanced understanding of the correlation between technical performance and business outcomes. 
+  Improved decision-making process informed by real-time data. 
+  Proactive identification and mitigation of issues before they affect business outcomes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Leverage tools like Amazon CloudWatch to perform metric analysis. AWS services such as CloudWatch anomaly detection and Amazon DevOps Guru can be used to detect anomalies, especially when static thresholds are unknown or when patterns of behavior are more suited for anomaly detection. 

### Implementation steps
Implementation steps

1.  **Analyze and review:** Regularly review and interpret your workload metrics. 

   1.  Prioritize business outcome metrics over purely technical metrics. 

   1.  Understand the significance of spikes, drops, or patterns in your data. 

1.  **Utilize Amazon CloudWatch:** Use Amazon CloudWatch for a centralized view and deep-dive analysis. 

   1.  Configure CloudWatch dashboards to visualize your metrics and compare them over time. 

   1.  Use [percentiles in CloudWatch](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/) to get a clear view of metric distribution, which can help in defining SLAs and understanding outliers. 

   1.  Set up [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to identify unusual patterns without relying on static thresholds. 

   1.  Implement [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html) to monitor and troubleshoot applications that span multiple accounts within a Region. 

   1.  Use [CloudWatch Metric Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html) to query and analyze metric data across accounts and Regions, identifying trends and anomalies. 

   1.  Apply [CloudWatch Metric Math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) to transform, aggregate, or perform calculations on your metrics for deeper insights. 

1.  **Employ Amazon DevOps Guru:** Incorporate [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning-enhanced anomaly detection to identify early signs of operational issues for your serverless applications and remediate them before they impact your customers. 

1.  **Optimize based on insights:** Make informed decisions based on your metric analysis to adjust and improve your workloads. 

 **Level of effort for the Implementation Plan:** Medium 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 

 **Related documents:** 
+ [ The Wheel Blog - Emphasizing the importance of continually reviewing metrics ](https://aws.amazon.com/blogs/opensource/the-wheel/)
+ [ Percentile are important ](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/)
+ [ Using AWS Cost Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html)
+ [ CloudWatch cross-account observability ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html)
+ [ Query your metrics with CloudWatch Metrics Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html)

 **Related videos:** 
+ [ Enable Cross-Account Observability in Amazon CloudWatch ](https://www.youtube.com/watch?v=lUaDO9dqISc)
+ [ Introduction to Amazon DevOps Guru ](https://www.youtube.com/watch?v=2uA8q-8mTZY)
+ [ Continuously Analyze Metrics using AWS Cost Anomaly Detection](https://www.youtube.com/watch?v=IpQYBuay5OE)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Gaining operation insights with AIOps using Amazon DevOps Guru ](https://catalog.us-east-1.prod.workshops.aws/workshops/f92df379-6add-4101-8b4b-38b788e1222b/en-US)

# OPS08-BP02 Analyze workload logs
OPS08-BP02 Analyze workload logs

 Regularly analyzing workload logs is essential for gaining a deeper understanding of the operational aspects of your application. By efficiently sifting through, visualizing, and interpreting log data, you can continually optimize application performance and security. 

 **Desired outcome:** Rich insights into application behavior and operations derived from thorough log analysis, ensuring proactive issue detection and mitigation. 

 **Common anti-patterns:** 
+  Neglecting the analysis of logs until a critical issue arises. 
+  Not using the full suite of tools available for log analysis, missing out on critical insights. 
+  Solely relying on manual review of logs without leveraging automation and querying capabilities. 

 **Benefits of establishing this best practice:** 
+  Proactive identification of operational bottlenecks, security threats, and other potential issues. 
+  Efficient utilization of log data for continuous application optimization. 
+  Enhanced understanding of application behavior, aiding in debugging and troubleshooting. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) is a powerful tool for log analysis. Integrated features like CloudWatch Logs Insights and Contributor Insights make the process of deriving meaningful information from logs intuitive and efficient. 

### Implementation steps
Implementation steps

1.  **Set up CloudWatch Logs**: Configure applications and services to send logs to CloudWatch Logs. 

1.  **Use log anomaly detection:** Utilize [Amazon CloudWatch Logs anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/LogsAnomalyDetection.html) to automatically identify and alert on unusual log patterns. This tool helps you proactively manage anomalies in your logs and detect potential issues early. 

1.  **Set up CloudWatch Logs Insights**: Use [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to interactively search and analyze your log data. 

   1.  Craft queries to extract patterns, visualize log data, and derive actionable insights. 

   1.  Use [CloudWatch Logs Insights pattern analysis](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData_Patterns.html) to analyze and visualize frequent log patterns. This feature helps you understand common operational trends and potential outliers in your log data. 

   1.  Use [CloudWatch Logs compare (diff)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData_Compare.html) to perform differential analysis between different time periods or across different log groups. Use this capability to pinpoint changes and assess their impacts on your system's performance or behavior. 

1.  **Monitor logs in real-time with Live Tail:** Use [Amazon CloudWatch Logs Live Tail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogs_LiveTail.html) to view log data in real-time. You can actively monitor your application's operational activities as they occur, which provides immediate visibility into system performance and potential issues. 

1.  **Leverage Contributor Insights**: Use [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to identify top talkers in high cardinality dimensions like IP addresses or user-agents. 

1.  **Implement CloudWatch Logs metric filters**: Configure [CloudWatch Logs metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) to convert log data into actionable metrics. This allows you to set alarms or further analyze patterns. 

1.  **Implement [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html):** Monitor and troubleshoot applications that span multiple accounts within a Region. 

1.  **Regular review and refinement**: Periodically review your log analysis strategies to capture all relevant information and continually optimize application performance. 

 **Level of effort for the implementation plan:** Medium 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 

 **Related documents:** 
+  [Analyzing Log Data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) 
+  [Using CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) 
+  [Creating and Managing CloudWatch Log Metric Filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 

 **Related videos:** 
+  [Analyze Log Data with CloudWatch Logs Insights](https://www.youtube.com/watch?v=2s2xcwm8QrM) 
+  [Use CloudWatch Contributor Insights to Analyze High-Cardinality Data](https://www.youtube.com/watch?v=ErWRBLFkjGI) 

 **Related examples:** 
+  [CloudWatch Logs Sample Queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html) 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US/intro) 

# OPS08-BP03 Analyze workload traces
OPS08-BP03 Analyze workload traces

 Analyzing trace data is crucial for achieving a comprehensive view of an application's operational journey. By visualizing and understanding the interactions between various components, performance can be fine-tuned, bottlenecks identified, and user experiences enhanced. 

 **Desired outcome:** Achieve clear visibility into your application's distributed operations, enabling quicker issue resolution and an enhanced user experience. 

 **Common anti-patterns:** 
+  Overlooking trace data, relying solely on logs and metrics. 
+  Not correlating trace data with associated logs. 
+  Ignoring the metrics derived from traces, such as latency and fault rates. 

 **Benefits of establishing this best practice:** 
+  Improve troubleshooting and reduce mean time to resolution (MTTR). 
+  Gain insights into dependencies and their impact. 
+  Swift identification and rectification of performance issues. 
+  Leveraging trace-derived metrics for informed decision-making. 
+  Improved user experiences through optimized component interactions. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 [AWS X-Ray](https://www.docs.aws.com/xray/latest/devguide/aws-xray.html) offers a comprehensive suite for trace data analysis, providing a holistic view of service interactions, monitoring user activities, and detecting performance issues. Features like ServiceLens, X-Ray Insights, X-Ray Analytics, and Amazon DevOps Guru enhance the depth of actionable insights derived from trace data. 

### Implementation steps
Implementation steps

 The following steps offer a structured approach to effectively implementing trace data analysis using AWS services: 

1.  **Integrate AWS X-Ray**: Ensure X-Ray is integrated with your applications to capture trace data. 

1.  **Analyze X-Ray metrics**: Delve into metrics derived from X-Ray traces, such as latency, request rates, fault rates, and response time distributions, using the [service map](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-servicemap.html#xray-console-servicemap-view) to monitor application health. 

1.  **Use ServiceLens**: Leverage the [ServiceLens map](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_service_map.html) for enhanced observability of your services and applications. This allows for integrated viewing of traces, metrics, logs, alarms, and other health information. 

1.  **Enable X-Ray Insights**: 

   1.  Turn on [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) for automated anomaly detection in traces. 

   1.  Examine insights to pinpoint patterns and ascertain root causes, such as increased fault rates or latencies. 

   1.  Consult the insights timeline for a chronological analysis of detected issues. 

1.  **Use X-Ray Analytics**: [X-Ray Analytics](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html) allows you to thoroughly explore trace data, pinpoint patterns, and extract insights. 

1.  **Use groups in X-Ray**: Create groups in X-Ray to filter traces based on criteria such as high latency, allowing for more targeted analysis. 

1.  **Incorporate Amazon DevOps Guru**: Engage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) to benefit from machine learning models pinpointing operational anomalies in traces. 

1.  **Use CloudWatch Synthetics**: Use [CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_tracing.html) to create canaries for continually monitoring your endpoints and workflows. These canaries can integrate with X-Ray to provide trace data for in-depth analysis of the applications being tested. 

1.  **Use Real User Monitoring (RUM)**: With [AWS X-Ray and CloudWatch RUM](https://docs.aws.amazon.com/xray/latest/devguide/xray-services-RUM.html), you can analyze and debug the request path starting from end users of your application through downstream AWS managed services. This helps you identify latency trends and errors that impact your end users. 

1.  **Correlate with logs**: Correlate [trace data with related logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_troubleshooting.html#servicelens_troubleshooting_Nologs) within the X-Ray trace view for a granular perspective on application behavior. This allows you to view log events directly associated with traced transactions. 

1.  **Implement [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html):** Monitor and troubleshoot applications that span multiple accounts within a Region. 

 **Level of effort for the implementation plan:** Medium 

## Resources
Resources

 **Related best practices:** 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 

 **Related documents:** 
+  [Using ServiceLens to Monitor Application Health](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html) 
+  [Exploring Trace Data with X-Ray Analytics](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html) 
+  [Detecting Anomalies in Traces with X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-insights.html) 
+  [Continuous Monitoring with CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 

 **Related videos:** 
+  [Analyze and Debug Applications Using Amazon CloudWatch Synthetics & AWS X-Ray](https://www.youtube.com/watch?v=s2WvaV2eDO4) 
+  [Use AWS X-Ray Insights](https://www.youtube.com/watch?v=tl8OWHl6jxw) 

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US/intro) 
+  [Implementing X-Ray with AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html) 
+  [CloudWatch Synthetics Canary Templates](https://github.com/aws-samples/cloudwatch-synthetics-canary-terraform) 

# OPS08-BP04 Create actionable alerts
OPS08-BP04 Create actionable alerts

 Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability. 

 **Desired outcome:** Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk. 

 **Common anti-patterns:** 
+  Setting up too many non-critical alerts, leading to alert fatigue. 
+  Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues. 
+  Neglecting to address root causes, leading to repetitive alerts for the same issue. 

 **Benefits of establishing this best practice:** 
+  Reduced alert fatigue by focusing on actionable and relevant alerts. 
+  Improved system uptime and reliability through proactive issue detection and mitigation. 
+  Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
Implementation guidance

 To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected. 

### Implementation steps
Implementation steps

1.  **Determine key performance indicators (KPIs)**: Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately. 

1.  **Implement anomaly detection**: 
   +  **Use Amazon CloudWatch anomaly detection**: Set up [Amazon CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to automatically detect unusual patterns, which helps you only generate alerts for genuine anomalies. 
   +  **Use AWS X-Ray Insights**: 

     1.  Set up [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) to detect anomalies in trace data. 

     1.  Configure [notifications for X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications) to be alerted on detected issues. 
   +  **Integrate with Amazon DevOps Guru**: 

     1.  Leverage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning capabilities in detecting operational anomalies with existing data. 

     1.  Navigate to the [notification settings](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html#navigate-to-notification-settings) in DevOps Guru to set up anomaly alerts. 

1.  **Implement actionable alerts**: Design alerts that provide adequate information for immediate action. 

   1.  Monitor [AWS Health events with Amazon EventBridge rules](https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html), or integrate programatically with the AWS Health API to automate actions when you receive AWS Health events. These can be general actions, such as sending all planned lifecycle event messages to a chat interface, or specific actions, such as the initiation of a workflow in an IT service management tool. 

1.  **Reduce alert fatigue**: Minimize non-critical alerts. When teams are overwhelmed with numerous insignificant alerts, they can lose oversight of critical issues, which diminishes the overall effectiveness of the alert mechanism. 

1.  **Set up composite alarms**: Use [Amazon CloudWatch composite alarms](https://aws.amazon.com/bloprove-monitoring-efficiency-using-amazon-cloudwatch-composite-alarms-2/) to consolidate multiple alarms. 

1.  **Integrate with alert tools**: Incorporate tools like [Ops Genie](https://www.atlassian.com/software/opsgenie) and [PagerDuty](https://www.pagerduty.com/). 

1.  **Engage Amazon Q Developer in chat applications**: Integrate [Amazon Q Developer in chat applications](https://aws.amazon.com/chatbot/) to relay alerts to Amazon Chime, Microsoft Teams, and Slack. 

1.  **Alert based on logs**: Use [log metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) in CloudWatch to create alarms based on specific log events. 

1.  **Review and iterate**: Regularly revisit and refine alert configurations. 

 **Level of effort for the implementation plan:** Medium 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 

 **Related documents:** 
+  [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Create a composite alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) 
+  [Create a CloudWatch alarm based on anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) 
+  [DevOps Guru Notifications](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html) 
+  [X-ray insights notifications](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications) 
+  [Monitor, operate, and troubleshoot your AWS resources with interactive ChatOps](https://aws.amazon.com/chatbot/) 
+  [Amazon CloudWatch Integration Guide \$1 PagerDuty](https://support.pagerduty.com/docs/amazon-cloudwatch-integration-guide) 
+  [Integrate Opsgenie with Amazon CloudWatch](https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-amazon-cloudwatch/) 

 **Related videos:** 
+  [Create Composite Alarms in Amazon CloudWatch](https://www.youtube.com/watch?v=0LMQ-Mu-ZCY) 
+  [Amazon Q Developer in chat applications Overview](https://www.youtube.com/watch?v=0jUSEfHbTYk) 
+  [AWS On Air ft. Mutative Commands in Amazon Q Developer in chat applications](https://www.youtube.com/watch?v=u2pkw2vxrtk) 

 **Related examples:** 
+  [Alarms, incident management, and remediation in the cloud with Amazon CloudWatch](https://aws.amazon.com/bloarms-incident-management-and-remediation-in-the-cloud-with-amazon-cloudwatch/) 
+  [Tutorial: Creating an Amazon EventBridge rule that sends notifications to Amazon Q Developer in chat applications](https://docs.aws.amazon.com/chatbot/latest/adminguide/create-eventbridge-rule.html) 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US/intro) 

# OPS08-BP05 Create dashboards
OPS08-BP05 Create dashboards

 Dashboards are the human-centric view into the telemetry data of your workloads. While they provide a vital visual interface, they should not replace alerting mechanisms, but complement them. When crafted with care, not only can they offer rapid insights into system health and performance, but they can also present stakeholders with real-time information on business outcomes and the impact of issues. 

 **Desired outcome:** 

 Clear, actionable insights into system and business health using visual representations. 

 **Common anti-patterns:** 
+  Overcomplicating dashboards with too many metrics. 
+  Relying on dashboards without alerts for anomaly detection. 
+  Not updating dashboards as workloads evolve. 

 **Benefits of this best practice:** 
+  Immediate visibility into critical system metrics and KPIs. 
+  Enhanced stakeholder communication and understanding. 
+  Rapid insight into the impact of operational issues. 

 **Level of risk if this best practice isn't established:** Medium 

## Implementation guidance
Implementation guidance

 **Business-centric dashboards** 

 Dashboards tailored to business KPIs engage a wider array of stakeholders. While these individuals might not be interested in system metrics, they are keen on understanding the business implications of these numbers. A business-centric dashboard ensures that all technical and operational metrics being monitored and analyzed are in sync with overarching business goals. This alignment provides clarity, ensuring everyone is on the same page regarding what's essential and what's not. Additionally, dashboards that highlight business KPIs tend to be more actionable. Stakeholders can quickly understand the health of operations, areas that need attention, and the potential impact on business outcomes. 

 With this in mind, when creating your dashboards, ensure that there's a balance between technical metrics and business KPIs. Both are vital, but they cater to different audiences. Ideally, you should have dashboards that provide a holistic view of the system's health and performance while also emphasizing key business outcomes and their implications. 

 Amazon CloudWatch Dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different AWS Regions and accounts. 

### Implementation steps
Implementation steps

1.  **Create a basic dashboard:** [Create a new dashboard in CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html), giving it a descriptive name. 

1.  **Use Markdown widgets:** Before diving into the metrics, [use Markdown widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_text_dashboard.html) to add textual context at the top of your dashboard. This should explain what the dashboard covers, the significance of the represented metrics, and can also contain links to other dashboards and troubleshooting tools. 

1.  **Create dashboard variables:** [Incorporate dashboard variables](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_dashboard_variables.html) where appropriate to allow for dynamic and flexible dashboard views. 

1.  **Create metrics widgets:** [Add metric widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-and-work-with-widgets.html) to visualize various metrics your application emits, tailoring these widgets to effectively represent system health and business outcomes. 

1.  **Log Insights queries:** Utilize [CloudWatch Log Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_ExportQueryResults.html) to derive actionable metrics from your logs and display these insights on your dashboard. 

1.  **Set up alarms:** Integrate [CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_alarm_dashboard.html) into your dashboard for a quick view of any metrics breaching their thresholds. 

1.  **Use Contributor Insights:** Incorporate [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights-ViewReports.html) to analyze high-cardinality fields and get a clearer understanding of your resource's top contributors. 

1.  **Design custom widgets:** For specific needs not met by standard widgets, consider creating [custom widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_custom_widget_dashboard.html). These can pull from various data sources or represent data in unique ways. 

1.  **Use AWS Health Dashboard:** Use [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html) to get deeper insights into your account health, events, and upcoming changes that might affect your services and resources. You can also get a centralized view for health events in your AWS Organizations or build your own custom dashboards (for more detail, see Related examples). 

1.  **Iterate and refine:** As your application evolves, regularly revisit your dashboard to ensure its relevance. 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 

 **Related documents:** 
+  [Building Dashboards for Operational Visibility](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/) 
+  [Using Amazon CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 

 **Related videos:** 
+  [Create Cross Account & Cross Region CloudWatch Dashboards](https://www.youtube.com/watch?v=eIUZdaqColg) 
+  [AWS re:Invent 2021 - Gain enterprise visibility with AWS Cloud operation dashboards)](https://www.youtube.com/watch?v=NfMpYiGwPGo) 

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US/intro) 
+  [Application Monitoring with Amazon CloudWatch](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch/) 
+  [AWS Health Events Intelligence Dashboards and Insights](https://aws.amazon.com/blogs/mt/aws-health-events-intelligence-dashboards-insights/) 
+  [Visualize AWS Health events using Amazon Managed Grafana](https://aws.amazon.com/blogs/mt/visualize-aws-health-events-using-amazon-managed-grafana/) 

# OPS 9. How do you understand the health of your operations?


 Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action. 

**Topics**
+ [

# OPS09-BP01 Measure operations goals and KPIs with metrics
](ops_operations_health_measure_ops_goals_kpis.md)
+ [

# OPS09-BP02 Communicate status and trends to ensure visibility into operation
](ops_operations_health_communicate_status_trends.md)
+ [

# OPS09-BP03 Review operations metrics and prioritize improvement
](ops_operations_health_review_ops_metrics_prioritize_improvement.md)

# OPS09-BP01 Measure operations goals and KPIs with metrics
OPS09-BP01 Measure operations goals and KPIs with metrics

 Obtain goals and KPIs that define operations success from your organization and determine that metrics reflect these. Set baselines as a point of reference and reevaluate regularly. Develop mechanisms to collect these metrics from teams for evaluation. 

 **Desired outcome:** 
+  The goals and KPIs for the organization's operations teams have been published and shared. 
+  Metrics that reflect these KPIs are established. Examples may include: 
  +  Ticket queue depth or average age of ticket 
  +  Ticket count grouped by type of issue 
  +  Time spent working issues with or without a standardized operating procedure (SOP) 
  +  Amount of time spent recovering from a failed code push 
  +  Call volume 

 **Common anti-patterns:** 
+  Deployment deadlines are missed because developers are pulled away to perform troubleshooting tasks. Development teams argue for more personnel, but cannot quantify how many they need because the time taken away cannot be measured. 
+  A Tier 1 desk was set up to handle user calls. Over time, more workloads were added, but no headcount was allocated to the Tier 1 desk. Customer satisfaction suffers as call times increase and issues go longer without resolution, but management sees no indicators of such, preventing any action. 
+  A problematic workload has been handed off to a separate operations team for upkeep. Unlike other workloads, this new one was not supplied with proper documentation and runbooks. As such, teams spend longer troubleshooting and addressing failures. However, there are no metrics documenting this, which makes accountability difficult. 

 **Benefits of establishing this best practice:** Where workload monitoring shows the state of our applications and services, monitoring operations teams provide owners gain insight into changes among the consumers of those workloads, such as shifting business needs. Measure the effectiveness of these teams and evaluate them against business goals by creating metrics that can reflect the state of operations. Metrics can highlight support issues or identify when drifts occur away from a service level target. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Schedule time with business leaders and stakeholders to determine the overall goals of the service. Determine what the tasks of various operations teams should be and what challenges they could be approached with. Using these, brainstorm key performance indicators (KPIs) that might reflect these operations goals. These might be customer satisfaction, time from feature conception to deployment, average issue resolution time, and others. 

 Working from KPIs, identify the metrics and sources of data that might reflect these goals best. Customer satisfaction may be an combination of various metrics such as call wait or response times, satisfaction scores, and types of issues raised. Deployment times may be the sum of time needed for testing and deployment, plus any post-deployment fixes that needed to be added. Statistics showing the time spent on different types of issues (or the counts of those issues) can provide a window into where targeted effort is needed. 

## Resources
Resources

 **Related documents:** 
+ [ Quick - Using KPIs ](https://docs.aws.amazon.com/quicksight/latest/user/kpi.html)
+ [ Amazon CloudWatch - Using Metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)
+ [ Building Dashboards ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ How to track your cost optimization KPIs with KPI Dashboard ](https://aws.amazon.com/blogs/aws-cloud-financial-management/how-to-track-your-cost-optimization-kpis-with-the-kpi-dashboard/)

# OPS09-BP02 Communicate status and trends to ensure visibility into operation
OPS09-BP02 Communicate status and trends to ensure visibility into operation

 Knowing the state of your operations and its trending direction is necessary to identify when outcomes may be at risk, whether or not added work can be supported, or the effects that changes have had to your teams. During operations events, having status pages that users and operations teams can refer to for information can reduce pressure on communication channels and disseminate information proactively. 

 **Desired outcome:** 
+  Operations leaders have insight at a glance to see what sort of call volumes their teams are operating under and what efforts may be under way, such as deployments. 
+  Alerts are disseminated to stakeholders and user communities when impacts to normal operations occur. 
+  Organization leadership and stakeholders can check a status page in response to an alert or impact, and obtain information surrounding an operational event, such as points of contact, ticket information, and estimated recovery times. 
+  Reports are made available to leadership and other stakeholders to show operations statistics such as call volumes over a period of time, user satisfaction scores, numbers of outstanding tickets and their ages. 

 **Common anti-patterns:** 
+  A workload goes down, leaving a service unavailable. Call volumes spike as users request to know what's going on. Managers add to the volume requesting to know who's working an issue. Various operations teams duplicate efforts in trying to investigate. 
+  A desire for a new capability leads to several personnel being reassigned to an engineering effort. No backfill is provided, and issue resolution times spike. This information is not captured, and only after several weeks and dissatisfied user feedback does leadership become aware of the issue. 

 **Benefits of establishing this best practice:** During operational events where the business is impacted, much time and energy can be wasted querying information from various teams attempting to understand the situation. By establishing widely-disseminated status pages and dashboards, stakeholders can quickly obtain information such as whether or not an issue was detected, who has lead on the issue, or when a return to normal operations may be expected. This frees team members from spending too much time communicating status to others and more time addressing issues. 

 In addition, dashboards and reports can provide insights to decision-makers and stakeholders to see how operations teams are able to respond to business needs and how their resources are being allocated. This is crucial for determining if adequate resources are in place to support the business. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Build dashboards that show the current key metrics for your ops teams, and make them readily accessible both to operations leaders and management. 

 Build status pages that can be updated quickly to show when an incident or event is unfolding, who has ownership and who is coordinating the response. Share any steps or workarounds that users should consider on this page, and disseminate the location widely. Encourage users to check this location first when confronted with an unknown issue. 

 Collect and provide reports that show the health of operations over time, and distribute this to leaders and decision makers to illustrate the work of operations along with challenges and needs. 

 Share between teams these metrics and reports that best reflect goals and KPIs and where they have been influential in driving change. Dedicate time to these activities to elevate the importance of operations inside of and between teams. 

## Resources
Resources

 **Related documents:** 
+ [ Measure Progress ](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cloud-operating-model/measure-progress.html)
+ [ Building dashboards for operation visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)

 **Related solutions:** 
+ [ Data Operations ](https://aws.amazon.com/solutions/app-development/data-operations)

# OPS09-BP03 Review operations metrics and prioritize improvement
OPS09-BP03 Review operations metrics and prioritize improvement

 Setting aside dedicated time and resources for reviewing the state of operations ensures that serving the day-to-day line of business remains a priority. Pull together operations leaders and stakeholders to regularly review metrics, reaffirm or modify goals and objectives, and prioritize improvements. 

 **Desired outcome:** 
+  Operations leaders and staff regularly meet to review metrics over a given reporting period. Challenges are communicated, wins are celebrated, and lessons learned are shared. 
+  Stakeholders and business leaders are regularly briefed on the state of operations and solicited for input regarding goals, KPIs, and future initiatives. Tradeoffs between service delivery, operations, and maintenance are discussed and placed into context. 

 **Common anti-patterns:** 
+  A new product is launched, but the Tier 1 and Tier 2 operations teams are not adequately trained to support or given additional staff. Metrics that show the decrease in ticket resolution times and increase in incident volumes are not seen by leaders. Action is taken weeks later when subscription numbers start to fall as discontent users move off the platform. 
+  A manual process for performing maintenance on a workload has been in place for a long time. While a desire to automate has been present, this was a low priority given the low importance of the system. Over time however, the system has grown in importance and now these manual processes consume a majority of operations' time. No resources are scheduled for providing increased tooling to operations, leading to staff burnout as workloads increase. Leadership becomes aware once it's reported that staff are leaving for other competitors. 

 **Benefits of establishing this best practice:** In some organizations, it can become a challenge to allocate the same time and attention that is afforded to service delivery and new products or offerings. When this occurs, the line of business can suffer as the level of service expected slowly deteriorates. This is because operations does not change and evolve with the growing business, and can soon be left behind. Without regular review into the insights operations collects, the risk to the business may become visible only when it's too late. By allocating time to review metrics and procedures both among the operations staff and with leadership, the crucial role operations plays remains visible, and risks can be identified long before they reach critical levels. Operations teams get better insight into impending business changes and initiatives, allowing for proactive efforts to be undertaken. Leadership visibility into operations metrics showcases the role that these teams play in customer satisfaction, both internal and external, and let them better weigh choices for priorities, or ensure that operations has the time and resources to change and evolve with new business and workload initiatives. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Dedicate time to review operations metrics between stakeholders and operations teams and review report data. Place these reports in the contexts of the organizations goals and objectives to determine if they're being met. Identify sources of ambiguity where goals are not clear, or where there may be conflicts between what is asked for and what is given. 

 Identify where time, people, and tools can aid in operations outcomes. Determine which KPIs this would impact and what targets for success should be. Revisit regularly to ensure operations is resourced sufficiently to support the line of business. 

## Resources
Resources

 **Related documents:** 
+ [ Amazon Athena ](https://aws.amazon.com/athena/)
+ [ Amazon CloudWatch metrics and dimensions reference ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html)
+ [ Amazon Quick ](https://aws.amazon.com/quicksight/)
+ [AWS Glue](https://aws.amazon.com/glue/)
+ [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html)
+ [ Collect metrics and logs from Amazon EC2 instances and on-premises servers with the Amazon CloudWatch Agent ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html)
+ [ Using Amazon CloudWatch metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)

# OPS 10. How do you manage workload and operations events?


 Prepare and validate procedures for responding to events to minimize their disruption to your workload. 

**Topics**
+ [

# OPS10-BP01 Use a process for event, incident, and problem management
](ops_event_response_event_incident_problem_process.md)
+ [

# OPS10-BP02 Have a process per alert
](ops_event_response_process_per_alert.md)
+ [

# OPS10-BP03 Prioritize operational events based on business impact
](ops_event_response_prioritize_events.md)
+ [

# OPS10-BP04 Define escalation paths
](ops_event_response_define_escalation_paths.md)
+ [

# OPS10-BP05 Define a customer communication plan for service-impacting events
](ops_event_response_push_notify.md)
+ [

# OPS10-BP06 Communicate status through dashboards
](ops_event_response_dashboards.md)
+ [

# OPS10-BP07 Automate responses to events
](ops_event_response_auto_event_response.md)

# OPS10-BP01 Use a process for event, incident, and problem management
OPS10-BP01 Use a process for event, incident, and problem management

The ability to efficiently manage events, incidents, and problems is key to maintaining workload health and performance. It's crucial to recognize and understand the differences between these elements to develop an effective response and resolution strategy. Establishing and following a well-defined process for each aspect helps your team swiftly and effectively handle any operational challenges that arise.

 **Desired outcome:** Your organization effectively manages operational events, incidents, and problems through well-documented and centrally stored processes. These processes are consistently updated to reflect changes, streamlining handling and maintaining high service reliability and workload performance. 

 **Common anti-patterns:** 
+  You reactively, rather than proactively, respond to events. 
+  Inconsistent approaches are taken to different types of events or incidents. 
+ Your organization does not analyze and learn from incidents to prevent future occurrences.

 **Benefits of establishing this best practice:** 
+  Streamlined and standardized response processes. 
+  Reduced impact of incidents on services and customers. 
+  Expedited issue resolution. 
+  Continuous improvement in operational processes. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
Implementation guidance

 Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed. 

 **Understanding events, incidents, and problems** 
+  **Events:** An *event* is an observation of an action, occurrence, or change of state. Events can be planned or unplanned and they can originate internally or externally to the workload. 
+  **Incidents:** *Incidents* are events that require a response, like unplanned interruptions or degradations of service quality. They represent disruptions that need immediate attention to restore normal workload operation. 
+  **Problems:** *Problems* are the underlying causes of one or more incidents. Identifying and resolving problems involves digging deeper into the incidents to prevent future occurrences. 

### Implementation steps
Implementation steps

 **Events** 

1.  **Monitor events:** 
   +  [Implement observability](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/implement-observability.html) and [utilize workload observability](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/utilizing-workload-observability.html). 
   +  Monitor actions taken by a user, role, or an AWS service are recorded as events in [AWS CloudTrail](https://aws.amazon.com/cloudtrail/). 
   +  Respond to operational changes in your applications in real time with [Amazon EventBridge](https://aws.amazon.com/eventbridge/). 
   +  Continually assess, monitor, and record resource configuration changes with [AWS Config](https://aws.amazon.com/config/). 

1.  **Create processes:** 
   +  Develop a process to assess which events are significant and require monitoring. This involves setting thresholds and parameters for normal and abnormal activities. 
   +  Determine criteria escalating an event to an incident. This could be based on the severity, impact on users, or deviation from expected behavior. 
   +  Regularly review the event monitoring and response processes. This includes analyzing past incidents, adjusting thresholds, and refining alerting mechanisms. 

 **Incidents** 

1.  **Respond to incidents:** 
   +  Use insights from observability tools to quickly identify and respond to incidents. 
   +  Implement [AWS Systems Manager Ops Center](https://aws.amazon.com/systems-manager/features/#OpsCenter) to aggregate, organize, and prioritize operational items and incidents. 
   +  Use services like [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) and [AWS X-Ray](https://aws.amazon.com/xray/) for deeper analysis and troubleshooting. 
   +  Consider [AWS Managed Services (AMS)](https://aws.amazon.com/managed-services/) for enhanced incident management, leveraging its proactive, preventative, and detective capabilities. AMS extends operational support with services like monitoring, incident detection and response, and security management. 
   +  Enterprise Support customers can use [AWS Incident Detection and Response](https://aws.amazon.com/premiumsupport/aws-incident-detection-response/), which provides continual proactive monitoring and incident management for production workloads. 

1.  **Create an incident management process:** 
   +  Establish a structured incident management process, including clear roles, communication protocols, and steps for resolution. 
   +  Integrate incident management with tools like [Amazon Q Developer in chat applications](https://aws.amazon.com/chatbot/) for efficient response and coordination. 
   +  Categorize incidents by severity, with predefined [incident response plans](https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html) for each category. 

1.  **Learn and improve:** 
   +  Conduct [post-incident analysis](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_evolve_ops_perform_rca_process.html) to understand root causes and resolution effectiveness. 
   +  Continually update and improve response plans based on reviews and evolving practices. 
   +  Document and share lessons learned across teams to enhance operational resilience. 
   +  Enterprise Support customers can request the [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement. 

 **Problems** 

1.  **Identify problems:** 
   +  Use data from previous incidents to identify recurring patterns that may indicate deeper systemic issues. 
   +  Leverage tools like [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) and [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to analyze trends and uncover underlying problems. 
   +  Engage cross-functional teams, including operations, development, and business units, to gain diverse perspectives on the root causes. 

1.  **Create a problem management process:** 
   +  Develop a structured process for problem management, focusing on long-term solutions rather than quick fixes. 
   +  Incorporate root cause analysis (RCA) techniques to investigate and understand the underlying causes of incidents. 
   +  Update operational policies, procedures, and infrastructure based on findings to prevent recurrence. 

1.  **Continue to improve:** 
   +  Foster a culture of constant learning and improvement, encouraging teams to proactively identify and address potential problems. 
   +  Regularly review and revise problem management processes and tools to align with evolving business and technology landscapes. 
   +  Share insights and best practices across the organization to build a more resilient and efficient operational environment. 

1.  **Engage AWS Support:** 
   +  Use AWS support resources, such as [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/), for proactive guidance and optimization recommendations. 
   +  Enterprise Support customers can access specialized programs like [AWS Countdown](https://aws.amazon.com/premiumsupport/aws-countdown/) for support during critical events. 

 **Level of effort for the implementation plan:** Medium 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md)
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) 

 **Related documents:** 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+ [AWS Incident Detection and Response ](https://aws.amazon.com/premiumsupport/aws-incident-detection-response/)
+ [AWS Cloud Adoption Framework: Operations Perspective - Incident and problem management ](https://docs.aws.amazon.com/whitepapers/latest/aws-caf-operations-perspective/incident-and-problem-management.html)
+  [Incident Management in the Age of DevOps and SRE](https://www.infoq.com/presentations/incident-management-devops-sre/) 
+  [PagerDuty - What is Incident Management?](https://www.pagerduty.com/resources/learn/what-is-incident-management/) 

 **Related videos:** 
+ [ Top incident response tips from AWS](https://www.youtube.com/watch?v=Cu20aOvnHwA)
+ [AWS re:Invent 2022 - The Amazon Builders' Library: 25 yrs of Amazon operational excellence ](https://www.youtube.com/watch?v=DSRhgBd_gtw)
+ [AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201) ](https://www.youtube.com/watch?v=IbSgM4IP9IE)
+ [ Introducing Incident Manager from AWS Systems Manager](https://www.youtube.com/watch?v=I6lScgh4qds)

 **Related examples:** 
+  [AWS Proactive Services – Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+ [ How to Automate Incident Response with PagerDuty and AWS Systems Manager Incident Manager](https://aws.amazon.com/blogs/mt/how-to-automate-incident-response-with-pagerduty-and-aws-systems-manager-incident-manager/)
+ [ Engage Incident Responders with the On-Call Schedules in AWS Systems Manager Incident Manager](https://aws.amazon.com/blogs/mt/engage-incident-responders-with-the-on-call-schedules-in-aws-systems-manager-incident-manager/)
+ [ Improve the Visibility and Collaboration during Incident Handling in AWS Systems Manager Incident Manager](https://aws.amazon.com/blogs/mt/improve-the-visibility-and-collaboration-during-incident-handling-in-aws-systems-manager-incident-manager/)
+ [ Incident reports and service requests in AMS ](https://docs.aws.amazon.com/managedservices/latest/userguide/support-experience.html)

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 

# OPS10-BP02 Have a process per alert
OPS10-BP02 Have a process per alert

 Establishing a clear and defined process for each alert in your system is essential for effective and efficient incident management. This practice ensures that every alert leads to a specific, actionable response, improving the reliability and responsiveness of your operations. 

 **Desired outcome:** Every alert initiates a specific, well-defined response plan. Where possible, responses are automated, with clear ownership and a defined escalation path. Alerts are linked to an up-to-date knowledge base so that any operator can respond consistently and effectively. Responses are quick and uniform across the board, enhancing operational efficiency and reliability. 

 **Common anti-patterns:** 
+  Alerts have no predefined response process, leading to makeshift and delayed resolutions. 
+  Alert overload causes important alerts to be overlooked. 
+  Alerts are inconsistently handled due to lack of clear ownership and responsibility. 

 **Benefits of establishing this best practice:** 
+  Reduced alert fatigue by only raising actionable alerts. 
+  Decreased mean time to resolution (MTTR) for operational issues. 
+  Decreased mean time to investigate (MTTI), which helps reduce MTTR. 
+  Enhanced ability to scale operational responses. 
+  Improved consistency and reliability in handling operational events. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
Implementation guidance

 Having a process per alert involves establishing a clear response plan for each alert, automating responses where possible, and continually refining these processes based on operational feedback and evolving requirements. 

### Implementation steps
Implementation steps

 The following diagram illustrates the incident management workflow within [AWS Systems Manager Incident Manager](https://aws.amazon.com/systems-manager/features/incident-manager/). It is designed to respond swiftly to operational issues by automatically creating incidents in response to specific events from [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) or [Amazon EventBridge](https://aws.amazon.com/eventbridge/). When an incident is created, either automatically or manually, Incident Manager centralizes the management of the incident, organizes relevant AWS resource information, and initiates predefined response plans. This includes running Systems Manager Automation runbooks for immediate action, as well as creating a parent operational work item in OpsCenter to track related tasks and analyses. This streamlined process speeds up and coordinates incident response across your AWS environment. 

![\[Flowchart depicting how Incident Manager works - Amazon Q Developer in chat applications, escalation plans and contacts, and runbooks flow into response plans, which flow into incident and analysis. Amazon CloudWatch also flows into response plans.\]](http://docs.aws.amazon.com/wellarchitected/2024-06-27/framework/images/incident-manager-how-it-works.png)


 

1.  **Use composite alarms:** Create [composite alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) in CloudWatch to group related alarms, reducing noise and allowing for more meaningful responses. 

1.  **Integrate Amazon CloudWatch alarms with Incident Manager** Configure CloudWatch alarms to automatically create incidents in [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html). 

1.  **Integrate Amazon EventBridge with Incident Manager:** Create [EventBridge rules](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule.html) to react to events and create incidents using defined response plans. 

1.  **Prepare for incidents in Incident Manager:** 
   +  Establish detailed [response plans](https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html) in Incident Manager for each type of alert. 
   +  Establish chat channels through [Amazon Q Developer in chat applications](https://docs.aws.amazon.com/incident-manager/latest/userguide/chat.html) connected to response plans in Incident Manager, facilitating real-time communication during incidents across platforms like Slack, Microsoft Teams, and Amazon Chime. 
   +  Incorporate [Systems Manager Automation runbooks](https://docs.aws.amazon.com/incident-manager/latest/userguide/runbooks.html) within Incident Manager to drive automated responses to incidents. 

## Resources
Resources

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 

 **Related documents:** 
+ [AWS Cloud Adoption Framework: Operations Perspective - Incident and problem management ](https://docs.aws.amazon.com/whitepapers/latest/aws-caf-operations-perspective/incident-and-problem-management.html)
+ [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
+ [ Setting up AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/setting-up.html)
+ [ Preparing for incidents in Incident Manager ](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-response.html)

 **Related videos:** 
+ [ Top incident response tips from AWS](https://www.youtube.com/watch?v=Cu20aOvnHwA)

 **Related examples:** 
+ [AWS Workshops - AWS Systems Manager Incident Manager - Automate incident response to security events ](https://catalog.workshops.aws/automate-incident-response/en-US/settingupim/onboarding)

# OPS10-BP03 Prioritize operational events based on business impact
OPS10-BP03 Prioritize operational events based on business impact

 Responding promptly to operational events is critical, but not all events are equal. When you prioritize based on business impact, you also prioritize addressing events with the potential for significant consequences, such as safety, financial loss, regulatory violations, or damage to reputation. 

 **Desired outcome:** Responses to operational events are prioritized based on potential impact to business operations and objectives. This makes the responses efficient and effective. 

 **Common anti-patterns:** 
+  Every event is treated with the same level of urgency, leading to confusion and delays in addressing critical issues. 
+  You fail to distinguish between high and low impact events, leading to misallocation of resources. 
+  Your organization lacks a clear prioritization framework, resulting in inconsistent responses to operational events. 
+  Events are prioritized based on the order they are reported, rather than their impact on business outcomes. 

 **Benefits of establishing this best practice:** 
+  Ensures critical business functions receive attention first, minimizing potential damage. 
+  Improves resource allocation during multiple concurrent events. 
+  Enhances the organization's ability to maintain trust and meet regulatory requirements. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
Implementation guidance

 When faced with multiple operational events, a structured approach to prioritization based on impact and urgency is essential. This approach helps you make informed decisions, direct efforts where they're needed most, and mitigate the risk to business continuity. 

### Implementation steps
Implementation steps

1.  **Assess impact:** Develop a classification system to evaluate the severity of events in terms of their potential impact on business operations and objectives. The following example shows impact categories:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/2024-06-27/framework/ops_event_response_prioritize_events.html)

1.  **Assess urgency:** Define urgency levels for how quickly an event needs a response, considering factors such as safety, financial implications, and service-level agreements (SLAs). The following example demonstrates urgency categories:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/2024-06-27/framework/ops_event_response_prioritize_events.html)

1.  **Create a prioritization matrix:** 
   +  Use a matrix to cross-reference impact and urgency, assigning priority levels to different combinations. 
   +  Make the matrix accessible and understood by all team members responsible for operational event responses. 
   +  The following example matrix displays incident severity according to urgency and impact:     
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/2024-06-27/framework/ops_event_response_prioritize_events.html)

1.  **Train and communicate:** Train response teams on the prioritization matrix and the importance of following it during an event. Communicate the prioritization process to all stakeholders to set clear expectations. 

1.  **Integrate with incident response:** 
   +  Incorporate the prioritization matrix into your incident response plans and tools. 
   +  Automate the classification and prioritization of events where possible to speed up response times. 
   +  Enterprise Support customers can leverage [AWS Incident Detection and Response](https://aws.amazon.com/premiumsupport/aws-incident-detection-response/), which provides 24x7 proactive monitoring and incident management for production workloads. 

1.  **Review and adapt:** Regularly review the effectiveness of the prioritization process and make adjustments based on feedback and changes in the business environment. 

## Resources
Resources

 **Related best practices:** 
+  [OPS03-BP03 Escalation is encouraged](ops_org_culture_team_enc_escalation.md) 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 
+  [OPS09-BP01 Measure operations goals and KPIs with metrics](ops_operations_health_measure_ops_goals_kpis.md) 

 **Related documents:** 
+ [ Atlassian - Understanding incident severity levels ](https://www.atlassian.com/incident-management/kpis/severity-levels)
+ [ IT Process Map - Checklist Incident Priority ](https://wiki.en.it-processmaps.com/index.php/Checklist_Incident_Priority)

# OPS10-BP04 Define escalation paths
OPS10-BP04 Define escalation paths

Establish clear escalation paths within your incident response protocols to facilitate timely and effective action. This includes specifying prompts for escalation, detailing the escalation process, and pre-approving actions to expedite decision-making and reduce mean time to resolution (MTTR).

 **Desired outcome:** A structured and efficient process that escalates incidents to the appropriate personnel, minimizing response times and impact. 

 **Common anti-patterns:** 
+ Lack of clarity on recovery procedures leads to makeshift responses during critical incidents.
+ Absence of defined permissions and ownership results in delays when urgent action is needed.
+  Stakeholders and customers are not informed in line with expectations. 
+  Important decisions are delayed. 

 **Benefits of establishing this best practice:** 
+  Streamlined incident response through predefined escalation procedures. 
+  Reduced downtime with pre-approved actions and clear ownership. 
+  Improved resource allocation and support-level adjustments according to incident severity. 
+  Improved communication to stakeholders and customers. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Properly defined escalation paths are crucial for rapid incident response. AWS Systems Manager Incident Manager supports the setup of structured escalation plans and on-call schedules, which alert the right personnel so that they are ready to act when incidents occur. 

### Implementation steps
Implementation steps

1.  **Set up escalation prompts:** Set up [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-actions) to create an incident in [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com//incident-manager/latest/userguide/incident-creation.html). 

1.  ** Set up on-call schedules:** Create [on-call schedules](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-manager-on-call-schedule-create.html) in Incident Manager that align with your escalation paths. Equip on-call personnel with the necessary permissions and tools to act swiftly. 

1.  ** Detail escalation procedures: ** 
   +  Determine specific conditions under which an incident should be escalated. 
   +  Create [escalation plans](https://docs.aws.amazon.com/incident-manager/latest/userguide/escalation.html) in Incident Manager. 
   +  Escalation channels should consist of a contact or an on-call schedule. 
   +  Define the roles and responsibilities of the team at each escalation level. 

1.  **Pre-approve mitigation actions:** Collaborate with decision-makers to pre-approve actions for anticipated scenarios. Use [Systems Manager Automation runbooks](https://docs.aws.amazon.com//incident-manager/latest/userguide/tutorials-runbooks.html) integrated with Incident Manager to speed up incident resolution. 

1.  **Specify ownership:** Clearly identify internal owners for each step of the escalation path. 

1.  **Detail third-party escalations:** 
   +  Document third-party service-level agreements (SLAs), and align them with internal goals. 
   +  Set clear protocols for vendor communication during incidents. 
   +  Integrate vendor contacts into incident management tools for direct access. 
   +  Conduct regular drills that include third-party response scenarios. 
   +  Keep vendor escalation information well-documented and easily accessible. 

1.  **Train and rehearse escalation plans:** Train your team on the escalation process and conduct regular incident response drills or game days. Enterprise Support customers can request an [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/). 

1.  **Continue to improve:** Review the effectiveness of your escalation paths regularly. Update your processes based on lessons learned from incident post-mortems and continuous feedback. 

 **Level of effort for the implementation plan:** Moderate 

## Resources
Resources

 **Related best practices:** 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md) 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) 

 **Related documents:** 
+ [AWS Systems Manager Incident Manager Escalation Plans ](https://docs.aws.amazon.com/incident-manager/latest/userguide/escalation.html)
+ [ Working with on-call schedules in Incident Manager ](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-manager-on-call-schedule.html)
+ [ Creating and Managing Runbooks ](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html)
+ [ Temporary elevated access management with AWS IAM Identity Center](https://aws.amazon.com/blogs/security/temporary-elevated-access-management-with-iam-identity-center/)
+ [ Atlassian - Escalation policies for effective incident management ](https://www.atlassian.com/incident-management/on-call/escalation-policies)

# OPS10-BP05 Define a customer communication plan for service-impacting events
OPS10-BP05 Define a customer communication plan for service-impacting events

 Effective communication during service impacting events is critical to maintain trust and transparency with customers. A well-defined communication plan helps your organization quickly and clearly share information, both internally and externally, during incidents. 

 **Desired outcome:** 
+  A robust communication plan that effectively informs customers and stakeholders during service impacting events. 
+  Transparency in communication to build trust and reduce customer anxiety. 
+  Minimizing the impact of service impacting events on customer experience and business operations. 

 **Common anti-patterns:** 
+  Inadequate or delayed communication leads to customer confusion and dissatisfaction. 
+  Overly technical or vague messaging fails to convey the actual impact on users. 
+  There is no predefined communication strategy, resulting in inconsistent and reactive messaging. 

 **Benefits of establishing this best practice:** 
+  Enhanced customer trust and satisfaction through proactive and clear communication. 
+  Reduced burden on support teams by preemptively addressing customer concerns. 
+  Improved ability to manage and recover from incidents effectively. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Creating a comprehensive communication plan for service impacting events involves multiple facets, from choosing the right channels to crafting the message and tone. The plan should be adaptable, scalable, and cater to different outage scenarios. 

### Implementation steps
Implementation steps

1.  **Define roles and responsibilities:** 
   +  Assign a major incident manager to oversee incident response activities. 
   +  Designate a communications manager responsible for coordinating all external and internal communications. 
   +  Include the support manager to provide consistent communication through support tickets. 

1.  **Identify communication channels:** Select channels like workplace chat, email, SMS, social media, in-app notifications, and status pages. These channels should be resilient and able to operate independently during service impacting events. 

1.  ** Communicate quickly, clearly, and regularly to customers: ** 
   +  Develop templates for various service impairment scenarios, emphasizing simplicity and essential details. Include information about the service impairment, expected resolution time, and impact. 
   +  Use Amazon Pinpoint to alert customers using push notifications, in-app notifications, emails, text messages, voice messages, and messages over custom channels. 
   +  Use Amazon Simple Notification Service (Amazon SNS) to alert subscribers programatically or through email, mobile push notifications, and text messages. 
   +  Communicate status through dashboards by sharing an Amazon CloudWatch dashboard publicly. 
   +  Encourage social media engagement: 
     +  Actively monitor social media to understand customer sentiment. 
     +  Post on social media platforms for public updates and community engagement. 
     +  Prepare templates for consistent and clear social media communication. 

1.  **Coordinate internal communication:** Implement internal protocols using tools like Amazon Q Developer in chat applications for team coordination and communication. Use CloudWatch dashboards to communicate status. 

1.  ** Orchestrate communication with dedicated tools and services: ** 
   +  Use AWS Systems Manager Incident Manager with Amazon Q Developer in chat applications to set up dedicated chat channels for real-time internal communication and coordination during incidents. 
   +  Use AWS Systems Manager Incident Manager runbooks to automate customer notifications through Amazon Pinpoint, Amazon SNS, or third-party tools like social media platforms during incidents. 
   +  Incorporate approval workflows within runbooks to optionally review and authorize all external communications before sending. 

1.  ** Practice and improve: ** 
   +  Conduct training on the use of communication tools and strategies. Empower teams to make timely decisions during incidents. 
   +  Test the communication plan through regular drills or gamedays. Use these tests to refine messaging and evaluate the effectiveness of channels. 
   +  Implement feedback mechanisms to assess communication effectiveness during incidents. Continually evolve the communication plan based on feedback and changing needs. 

 **Level of effort for the implementation plan:** High 

## Resources
Resources

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md) 
+  [OPS10-BP06 Communicate status through dashboards](ops_event_response_dashboards.md) 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) 

 **Related documents:** 
+ [ Atlassian - Incident communication best practices ](https://www.atlassian.com/incident-management/incident-communication)
+ [ Atlassian - How to write a good status update ](https://www.atlassian.com/blog/statuspage/how-to-write-a-good-status-update)
+ [ PagerDuty - A Guide to Incident Communications ](https://www.pagerduty.com/resources/learn/a-guide-to-incident-communications/)

 **Related videos:** 
+ [ Atlassian - Create your own incident communication plan: Incident templates ](https://www.youtube.com/watch?v=ZROVn6-K2qU)

 **Related examples:** 
+  [AWS Health Dashboard ](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/) 
+ [ Example AWS status updates ](https://health.aws.amazon.com/health/status?eventID=arn:aws:health:global::event/IAM/AWS_IAM_OPERATIONAL_ISSUE/AWS_IAM_OPERATIONAL_ISSUE_91687_3F72CAD4902)

# OPS10-BP06 Communicate status through dashboards
OPS10-BP06 Communicate status through dashboards

 Use dashboards as a strategic tool to convey real-time operational status and key metrics to different audiences, including internal technical teams, leadership, and customers. These dashboards offer a centralized, visual representation of system health and business performance, enhancing transparency and decision-making efficiency. 

 **Desired outcome:** 
+  Your dashboards provide a comprehensive view of the system and business metrics relevant to different stakeholders. 
+  Stakeholders can proactively access operational information, reducing the need for frequent status requests. 
+  Real-time decision-making is enhanced during normal operations and incidents. 

 **Common anti-patterns:** 
+ Engineers joining an incident management call require status updates to get up to speed.
+ Relying on manual reporting for management, which leads to delays and potential inaccuracies.
+  Operations teams are frequently interrupted for status updates during incidents. 

 **Benefits of establishing this best practice:** 
+  Empowers stakeholders with immediate access to critical information, promoting informed decision-making. 
+  Reduces operational inefficiencies by minimizing manual reporting and frequent status inquiries. 
+  Increases transparency and trust through real-time visibility into system performance and business metrics. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Dashboards effectively communicate the status of your systems and business metrics and can be tailored to the needs of different audience groups. Tools like Amazon CloudWatch dashboards and Amazon Quick help you create interactive, real-time dashboards for system monitoring and business intelligence. 

### Implementation steps
Implementation steps

1.  **Identify stakeholder needs:** Determine the specific information needs of different audience groups, such as technical teams, leadership, and customers. 

1.  ** Choose the right tools:** Select appropriate tools like [Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) for system monitoring and [Amazon Quick](https://aws.amazon.com/quicksight/) for interactive business intelligence. 

1.  **Design effective dashboards:** 
   +  Design dashboards to clearly present relevant metrics and KPIs, ensuring they are understandable and actionable. 
   +  Incorporate system-level and business-level views as needed. 
   +  Include both high-level (for broad overviews) and low-level (for detailed analysis) dashboards. 
   +  Integrate automated alarms within dashboards to highlight critical issues. 
   +  Annotate dashboards with important metrics thresholds and goals for immediate visibility. 

1.  **Integrate data sources:** 
   +  Use [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to aggregate and display metrics from various AWS services and [query metrics from other data sources](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/MultiDataSourceQuerying.html), creating a unified view of your system's health and business metrics. 
   +  Use features like [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to query and visualize log data from different applications and services. 

1.  **Provide self-service access:** 
   +  Share CloudWatch dashboards with relevant stakeholders for self-service information access using [dashboard sharing features](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html). 
   +  Ensure that dashboards are easily accessible and provide real-time, up-to-date information. 

1.  **Regularly update and refine:** 
   +  Continually update and refine dashboards to align with evolving business needs and stakeholder feedback. 
   +  Regularly review the dashboards to keep them relevant and effective for conveying the necessary information. 

## Resources
Resources

 **Related best practices:** 
+  [OPS08-BP05 Create dashboards](ops_workload_observability_create_dashboards.md) 

 **Related documents:** 
+ [ Building dashboards for operational visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ Using Amazon CloudWatch dashboards ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)
+ [ Create flexible dashboards with dashboard variables ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_dashboard_variables.html)
+ [ Sharing CloudWatch dashboards ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-dashboard-sharing.html)
+ [ Query metrics from other data sources ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/MultiDataSourceQuerying.html)
+ [ Add a custom widget to a CloudWatch dashboard ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_custom_widget_dashboard.html)

 **Related examples:** 
+ [ One Observability Workshop - Dashboards ](https://catalog.us-east-1.prod.workshops.aws/workshops/31676d37-bbe9-4992-9cd1-ceae13c5116c/en-US/aws-native/dashboards)

# OPS10-BP07 Automate responses to events
OPS10-BP07 Automate responses to events

 Automating event responses is key for fast, consistent, and error-free operational handling. Create streamlined processes and use tools to automatically manage and respond to events, minimizing manual interventions and enhancing operational effectiveness. 

 **Desired outcome:** 
+  Reduced human errors and faster resolution times through automation. 
+  Consistent and reliable operational event handling. 
+  Enhanced operational efficiency and system reliability. 

 **Common anti-patterns:** 
+ Manual event handling leads to delays and errors.
+ Automation is overlooked in repetitive, critical tasks.
+  Repetitive, manual tasks lead to alert fatigue and missing critical issues. 

 **Benefits of establishing this best practice:** 
+  Accelerated event responses, reducing system downtime. 
+  Reliable operations with automated and consistent event handling. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
Implementation guidance

 Incorporate automation to create efficient operational workflows and minimize manual interventions. 

### Implementation steps
Implementation steps

1.  **Identify automation opportunites:** Determine repetitive tasks for automation, such as issue remediation, ticket enrichment, capacity management, scaling, deployments, and testing. 

1.  **Identify automation prompts:** 
   +  Assess and define specific conditions or metrics that initiate automated responses using [Amazon CloudWatch alarm actions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-actions). 
   +  Use [Amazon EventBridge](https://aws.amazon.com/eventbridge/) to respond to events in AWS services, custom workloads, and SaaS applications. 
   +  Consider initiation events such as [specific log entries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html), [performance metrics thresholds](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html), or [state changes](https://docs.aws.amazon.com/config/latest/developerguide/remediation.html) in AWS resources. 

1.  **Implement event-driven automation:** 
   +  Use AWS Systems Manager Automation runbooks to simplify maintenance, deployment, and remediation tasks. 
   +  [Creating incidents in Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html) automatically gathers and adds details about the involved AWS resources to the incident. 
   +  Proactively monitor quotas using [Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/). 
   +  Automatically adjust capacity with [AWS Auto Scaling](https://aws.amazon.com/autoscaling/) to maintain availability and performance. 
   +  Automate development pipelines with [Amazon CodeCatalyst](https://codecatalyst.aws/explore). 
   +  Smoke test or continually monitor endpoints and APIs [using synthetic monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html). 

1.  **Perform risk mitigation through automation:** 
   +  Implement [automated security responses](https://aws.amazon.com/solutions/implementations/automated-security-response-on-aws/) to swiftly address risks. 
   +  Use [AWS Systems Manager State Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-state.html) to reduce configuration drift. 
   +  [Remediate noncompliant resources with AWS Config Rules](https://docs.aws.amazon.com/config/latest/developerguide/remediation.html). 

 **Level of effort for the implementation plan:** High 

## Resources
Resources

 **Related best practices:** 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md) 

 **Related documents:** 
+  [Using Systems Manager Automation runbooks with Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/tutorials-runbooks.html) 
+  [Creating incidents in Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html) 
+  [AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [Monitor resource usage and send notifications when approaching quotas](https://docs.aws.amazon.com/solutions/latest/quota-monitor-for-aws/solution-overview.html) 
+  [AWS Auto Scaling](https://aws.amazon.com/autoscaling/) 
+  [What is Amazon CodeCatalyst?](https://docs.aws.amazon.com/codecatalyst/latest/userguide/welcome.html) 
+  [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Using Amazon CloudWatch alarm actions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-actions) 
+  [Remediating Noncompliant Resources with AWS Config Rules](https://docs.aws.amazon.com/config/latest/developerguide/remediation.html) 
+  [Creating metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
+  [AWS Systems Manager State Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-state.html) 

 **Related videos:** 
+ [ Create Automation Runbooks with AWS Systems Manager](https://www.youtube.com/watch?v=fQ_KahCPBeU)
+ [ How to automate IT Operations on AWS](https://www.youtube.com/watch?v=GuWj_mlyTug)
+ [AWS Security Hub CSPM automation rules ](https://www.youtube.com/watch?v=XaMfO_MERH8)
+ [ Start your software project fast with Amazon CodeCatalyst blueprints ](https://www.youtube.com/watch?v=rp7roaoPzFE)

 **Related examples:** 
+ [ Amazon CodeCatalyst Tutorial: Creating a project with the Modern three-tier web application blueprint ](https://docs.aws.amazon.com/codecatalyst/latest/userguide/getting-started-template-project.html)
+ [ One Observability Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/31676d37-bbe9-4992-9cd1-ceae13c5116c/en-US)
+ [ Respond to incidents using Incident Manager ](https://catalog.workshops.aws/getting-started-with-com/en-US/operations-management/incident-manager)