

# OPS 8. How do you understand the health of your workload?
<a name="ops-08"></a>

 Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action. 

**Topics**
+ [OPS08-BP01 Identify key performance indicators](ops_workload_health_define_workload_kpis.md)
+ [OPS08-BP02 Define workload metrics](ops_workload_health_design_workload_metrics.md)
+ [OPS08-BP03 Collect and analyze workload metrics](ops_workload_health_collect_analyze_workload_metrics.md)
+ [OPS08-BP04 Establish workload metrics baselines](ops_workload_health_workload_metric_baselines.md)
+ [OPS08-BP05 Learn expected patterns of activity for workload](ops_workload_health_learn_workload_usage_patterns.md)
+ [OPS08-BP06 Alert when workload outcomes are at risk](ops_workload_health_workload_outcome_alerts.md)
+ [OPS08-BP07 Alert when workload anomalies are detected](ops_workload_health_workload_anomaly_alerts.md)
+ [OPS08-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics](ops_workload_health_biz_level_view_workload.md)

# OPS08-BP01 Identify key performance indicators
<a name="ops_workload_health_define_workload_kpis"></a>

 Identify key performance indicators (KPIs) based on desired business outcomes (for example, order rate, customer retention rate, and profit versus operating expense) and customer outcomes (for example, customer satisfaction). Evaluate KPIs to determine workload success. 

 **Common anti-patterns:** 
+  You are asked by business leadership how successful a workload has been serving business needs but have no frame of reference to determine success. 
+  You are unable to determine if the commercial off-the-shelf application you operate for your organization is cost-effective. 

 **Benefits of establishing this best practice:** By identifying key performance indicators you help achieve business outcomes as the test of the health and success of your workload. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine workload success. 

# OPS08-BP02 Define workload metrics
<a name="ops_workload_health_design_workload_metrics"></a>

Define metrics that measure the health of the workload. Workload health is measured by the achievement of business outcomes (KPIs) and the state of workload components and applications. Examples of KPIs are abandoned shopping carts, orders placed, cost, price, and allocated workload expense. While you may collect telemetry from multiple components, select a subset that provides insight into the overall workload health. Adjust workload metrics over time as business needs change. 

 **Desired outcome:** 
+  You have identified metrics that validate the achievement of KPIs that reflect business outcomes. 
+  You have metrics that show a consistent view of workload health. 
+  Workload metrics are evaluated periodically as business needs change. 

 **Common anti-patterns:** 
+ You are monitoring all the applications in your workload but are unable to determine if your workload is achieving business outcomes.
+ You have defined workload metrics but they are not associated to any business KPIs.

 **Benefits of establishing this best practice:** 
+  You can measure your workload against the achievement of business outcomes. 
+  You know if your workload is in a healthy state or needs intervention. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 The goal of this best practice is that you can answer the following question: is my workload healthy? Workload health is determined by the achievement of business outcomes and the state of applications and components in the workload. Work backwards from business KPIs to identify metrics. Identify key metrics from components and applications. Periodically review workload metrics as business needs change. 

 **Customer example** 

 Workload health is determined at AnyCompany Retail by a collection of application and component metrics. Starting with business KPIs, they identify metrics like order rate that can show they are achieving business outcomes. They also include key application metrics like page response and component metrics like open database connections. On a quarterly basis, they re-evaluate workload metrics to make sure they are still valid in determining workload health. 

 **Implementation steps** 

1.  Starting with business KPIs, identify metrics that show you are achieving business outcomes. If there are KPIs that do not have metrics, instrument your workload with additional metrics for any missing business KPIs. 

   1.  You can publish custom metrics from your applications to [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). 

   1.  The [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) can collect metrics from existing applications and be used to add new metrics. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop will help you build an observability strategy for your workload. 

1.  Identify metrics for applications and components in the workload. What are key metrics that show the health of individual components and applications? Applications and components may emit many different metrics, but choose one to three key metrics that show their overall health. 

1.  Implement a mechanism to evaluate workload metrics periodically. When business KPIs change, work with stakeholders to update workload metrics. As your workload components and applications evolve, adjust your workload metrics. 

 **Level of effort for the implementation plan:** Medium. Adding metrics for business KPIs to applications may require moderate effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - Your application must emit telemetry that supports business outcomes. 
+  [OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md) - You must instrument your workload to emit telemetry before you can define workload metrics that support business outcomes. 
+  [OPS08-BP01 Identify key performance indicators](ops_workload_health_define_workload_kpis.md) - You must identify key performance indicators first before selecting workload metrics. 

 **Related documents:** 
+ [ Adding metrics and traces to your application on Amazon EKS with AWS Distro for OpenTelemetry, AWS X-Ray, and Amazon CloudWatch ](https://aws.amazon.com/blogs/mt/adding-metrics-and-traces-to-your-application-on-amazon-eks-with-aws-distro-for-opentelemetry-aws-x-ray-and-amazon-cloudwatch/)
+ [ Instrumenting distributed systems for operational visibility ](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/)
+ [ Implementing health checks ](https://aws.amazon.com/builders-library/implementing-health-checks/)
+ [ How to Monitor your Applications Effectively ](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/)
+ [ How to better monitor your custom application metrics using Amazon CloudWatch Agent ](https://aws.amazon.com/blogs/devops/new-how-to-better-monitor-your-custom-application-metrics-using-amazon-cloudwatch-agent/)

 **Related videos:** 
+ [AWS re:Invent 2020: Monitoring production services at Amazon ](https://www.youtube.com/watch?v=hnPcf_Czbvw)
+ [AWS re:Invent 2022 - Building observable applications with OpenTelemetry (BOA310) ](https://www.youtube.com/watch?v=efk8XFJrW2c)
+ [ How to Easily Setup Application Monitoring for Your AWS Workloads - AWS Online Tech Talks ](https://www.youtube.com/watch?v=LKCth30RqnA)
+ [ Mastering Observability of Your Serverless Applications - AWS Online Tech Talks ](https://www.youtube.com/watch?v=CtsiXhiAUq8)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

 **Related services:** 
+ [ Amazon CloudWatch ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [AWS Distro for OpenTelemetry ](https://aws-otel.github.io/)

# OPS08-BP03 Collect and analyze workload metrics
<a name="ops_workload_health_collect_analyze_workload_metrics"></a>

Perform regular, proactive reviews of workload metrics to identify trends and determine if a response is necessary and validate the achievement of business outcomes. Aggregate metrics from your workload applications and components to a central location. Use dashboards and analytics tools to analyze telemetry and determine workload health. Implement a mechanism to conduct workload health reviews on periodic basis with stakeholders in your organization. 

 **Desired outcome:** 
+  Workload metrics are collected in a central location. 
+  Dashboards and analytics tools are used to analyze workload health trends. 
+  You conduct periodic workload metric reviews with your organization. 

 **Common anti-patterns:** 
+  Your organization collects metrics from the workload in two different observability platforms. You are unable to determine workload health because the platforms are incompatible. 
+  Error rates for a component of your workload are slowly increasing. You fail to notice this trend because your organization does not conduct periodic workload metric reviews. The component fails after a week, impairing your workload. 

 **Benefits of establishing this best practice:** 
+  You have increased awareness of workload health and the achievement of business outcomes. 
+  Workload health trends can be developed over time. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Collect workload metrics in a central location. Using dashboards and analytics tools, analyze workload metrics to gain insight into workload health, develop workload health trends, and validate the achievement of business outcomes. Implement a mechanism to conduct periodic reviews of workload metrics. 

 **Customer example** 

 AnyCompany Retail conducts workload metric reviews every week on Wednesday. They gather stakeholders from across the company and go through the previous week’s metrics. During the meeting, they highlight trends and insights gleaned from analytics tools. Internal dashboards are published with key workload metrics that any employee can view and search. 

 **Implementation steps** 

1.  Identify the workload metrics that are tied to workload health. Starting with business KPIs, identify the metrics for applications, components, and platforms that provide an overall view of workload health. 

   1.  You can publish custom metrics to [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). You can leverage the [Amazon CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) to collect metrics and logs from Amazon EC2 instances and on-premises servers. 

   1.  The [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) can collect metrics from existing applications and be used to add new metrics. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop helps you build an observability strategy for your workload. 

1.  Collect workload metrics in a central platform. If workload metrics are split between different platform, this can make it difficult to analyze and develop trends. The platform should have dashboards and analytic capabilities. 

   1.  [Amazon CloudWatch](https://docs.aws.amazon.com/) can collect and store workload metrics. In multi-account topologies, it is recommended to have a [central logging and monitoring account](https://docs.aws.amazon.com/prescriptive-guidance/latest/security-reference-architecture/log-archive.html), referred to as a *log archive account*. 

1.  Build a consolidated dashboard of workload metrics. Use this view for metrics reviews and analysis of trends. 

   1.  You can create custom [CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) to collect your workload metrics in a consolidated view. 

1.  Implement a workload metric review process. On a weekly, bi-weekly, or monthly basis, review your workload metrics with stakeholders, including technical and non-technical personnel. Use these review sessions to identify trends and gain insight into workload health. 

 **Level of effort for the implementation plan:** High. If workload metrics are not centrally collected, it could require significant investment to consolidate them in one platform. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS08-BP01 Identify key performance indicators](ops_workload_health_define_workload_kpis.md) - You must identify key performance indicators first before selecting workload metrics. 
+  [OPS08-BP02 Define workload metrics](ops_workload_health_design_workload_metrics.md) - You must define workload metrics before collecting and analyzing them. 

 **Related documents:** 
+ [ Power operational insights with Amazon Quick ](https://aws.amazon.com/blogs/big-data/power-operational-insights-with-amazon-quicksight/)
+ [ Using Amazon CloudWatch dashboards custom widgets ](https://aws.amazon.com/blogs/mt/introducing-amazon-cloudwatch-dashboards-custom-widgets/)

 **Related videos:** 
+ [ Create Cross Account & Cross Region CloudWatch Dashboards ](https://www.youtube.com/watch?v=eIUZdaqColg)
+ [ Monitor AWS Resources Using Amazon CloudWatch Dashboards ](https://www.youtube.com/watch?v=I7EFLChc07M)

 **Related examples:** 
+ [AWS Management and Governance Tools Workshop - CloudWatch Dashboards ](https://mng.workshop.aws/operations-2022/detect/cwdashboard.html)
+ [ Well-Architected Labs - Level 100: Monitoring with CloudWatch Dashboards ](https://www.wellarchitectedlabs.com/performance-efficiency/100_labs/100_monitoring_with_cloudwatch_dashboards/)

 **Related services:** 
+  [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 
+ [AWS Distro for OpenTelemetry](https://aws-otel.github.io/)

# OPS08-BP04 Establish workload metrics baselines
<a name="ops_workload_health_workload_metric_baselines"></a>

Establishing a baseline for workload metrics aids in understanding workload health and performance. Using baselines, you can identify under- and over-performing applications and components. A workload baseline adds to your ability to mitigate issues before they become incidents. Baselines are foundational in developing patterns of activity and implementing anomaly detection when metrics deviate from expected values. 

 **Desired outcome:** 
+  You have a baseline level of metrics for your workload under normal conditions. 
+  You can determine if your workload is functioning normally. 

 **Common anti-patterns:** 
+  After deploying a new feature, there is drop in request latency. A baseline was not established for a composite metric of incoming processed requests and overall latency. You are unable to determine if the change caused an improvement or caused a defect. 
+  A sudden spike in user activity occurs, but you have not established a metric baseline. The activity spike slowly leads to a memory leak in an application. Eventually this takes your workload offline. 

 **Benefits of establishing this best practice:** 
+  You understand the normal pattern of activity for your workload using metrics for key components and applications. 
+  You can determine if your workload, its applications, and components, are behaving normally or may require intervention. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use historical data to establish a baseline of workload metrics for applications and components in your workload. Leverage the metric baseline in metric review meetings and troubleshooting. Periodically review workload performance and adjust the baseline as the architecture evolves. 

 **Customer example** 

 Baselines are established for all components and applications at AnyCompany Retail. Using historical data, AnyCompany Retail developed their workload metric baselines over a two-month metric window. Every two months they re-assess baselines and adjust them based on real-world data. 

 **Implementation steps** 

1.  Working backwards from your workload metrics, establish a metric baseline for key components and applications using historical data. Limit the number of metrics per component or application and avoid monitor fatigue. 

   1.  You can use [Amazon CloudWatch Metrics Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html) to query metrics at scale and identify trends and patterns. 

   1.  [Amazon CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) uses machine learning algorithms to identify patterns of behavior for metrics, determine baselines, and surfaces anomalies. 

   1.  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) provides the ability to detect operational issues with your workload using machine learning. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop will help you build an observability strategy for your workload. 

1.  Put in place a mechanism to periodically review workload metric baselines, especially before significant business events. At least once a quarter, evaluate your workload metric baseline using historical data. Use the baseline in your metric review meetings. 

 **Level of effort for the implementation plan:** Low. Having established workload metrics, establishing baselines may require you to collect enough data to identify normal patterns of behavior. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS08-BP02 Define workload metrics](ops_workload_health_design_workload_metrics.md) - Workload metrics must be established first before determining baselines. 
+  [OPS08-BP03 Collect and analyze workload metrics](ops_workload_health_collect_analyze_workload_metrics.md) - Collecting and analyzing workload metrics is necessary to have in place before establishing metric baselines. 
+  [OPS08-BP05 Learn expected patterns of activity for workload](ops_workload_health_learn_workload_usage_patterns.md) - This best practice builds on top of the baseline to develop usage trends. 
+  [OPS08-BP06 Alert when workload outcomes are at risk](ops_workload_health_workload_outcome_alerts.md) - Metric baselines are necessary to identifying thresholds and developing alerts. 
+  [OPS08-BP07 Alert when workload anomalies are detected](ops_workload_health_workload_anomaly_alerts.md) - Anomaly detection requires the establishment of metric baselines. 

 **Related documents:** 
+ [AWS Observability Best Practices - Alarms ](https://aws-observability.github.io/observability-best-practices/tools/alarms/)
+ [ How to Monitor your Applications Effectively ](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/)
+ [ How to set up CloudWatch Anomaly Detection to set dynamic alarms, automate actions, and drive online sales ](https://aws.amazon.com/blogs/mt/how-to-set-up-cloudwatch-anomaly-detection-to-set-dynamic-alarms-automate-actions-and-drive-online-sales/)
+ [ Operationalizing CloudWatch Anomaly Detection ](https://aws.amazon.com/blogs/mt/operationalizing-cloudwatch-anomaly-detection/)

 **Related videos:** 
+ [AWS re:Invent 2020: Monitoring production services at Amazon ](https://www.youtube.com/watch?v=hnPcf_Czbvw)
+ [AWS re:Invent 2021- Get insights from operational metrics at scale with CloudWatch Metrics Insights ](https://www.youtube.com/watch?v=xKib0xvbIfo)
+ [AWS re:Invent 2022 - Developing an observability strategy (COP302) ](https://www.youtube.com/watch?v=Ub3ATriFapQ)
+ [AWS Summit DC 2022 - Monitoring and observability for modern applications ](https://www.youtube.com/watch?v=AHiuyT0B5Gk)
+ [AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310) ](https://www.youtube.com/watch?v=or7uFFyHIX0)

 **Related examples:** 
+ [AWS CloudTrail and Amazon CloudWatch Integration Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/2e48b9fc-f721-4417-b811-962b7f31b61c/en-US)

 **Related services:** 
+ [ Amazon CloudWatch ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [ Amazon DevOps Guru ](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html)

# OPS08-BP05 Learn expected patterns of activity for workload
<a name="ops_workload_health_learn_workload_usage_patterns"></a>

 Establish patterns of workload activity to identify anomalous behavior so that you can respond appropriately if required. 

 CloudWatch through the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature applies statistical and machine learning algorithms to generate a range of expected values that represent normal metric behavior. 

 [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) can be used to identify anomalous behavior through event correlation, log analysis, and applying machine learning to analyze your workload telemetry. When unexpected behaviors are detected, it provides the [related metrics and events](https://docs.aws.amazon.com/devops-guru/latest/userguide/understanding-insights-console.html) with recommendations to address the behavior. 

 **Common anti-patterns:** 
+  You are reviewing network utilization logs and see that network utilization increased between 11:30am and 1:30pm and then again at 4:30pm through 6:00pm. You are unaware if this should be considered normal or not. 
+  Your web servers reboot every night at 3:00am. You are unaware if this is an expected behavior. 

 **Benefits of establishing this best practice:** By learning patterns of behavior you can recognize unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Learn expected patterns of activity for workload: Establish patterns of workload activity to determine when behavior is outside of the expected values so that you can respond appropriately if required. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon DevOps Guru](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 

# OPS08-BP06 Alert when workload outcomes are at risk
<a name="ops_workload_health_workload_outcome_alerts"></a>

 Raise an alert when workload outcomes are at risk so that you can respond appropriately if necessary. 

 Ideally, you have previously identified a metric threshold that you are able to alarm upon or an event that you can use to initiate an automated response. 

 On AWS, you can use [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to create canary scripts to monitor your endpoints and APIs by performing the same actions as your customers. The telemetry generated and the [insight gained](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_Details.html) can help you to identify issues before your customers are impacted. 

 You can also use [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to interactively search and analyze your log data using a purpose-built query language. CloudWatch Logs Insights automatically [discovers fields in logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_AnalyzeLogData-discoverable-fields.html) from AWS services, and custom log events in JSON. It scales with your log volume and query complexity and gives you answers in seconds, helping you to search for the contributing factors of an incident. 

 **Common anti-patterns:** 
+  You have no network connectivity. No one is aware. No one is trying to identify why or taking action to restore connectivity. 
+  Following a patch, your persistent instances have become unavailable, disrupting users. Your users have opened support cases. No one has been notified. No one is taking action. 

 **Benefits of establishing this best practice:** By identifying that business outcomes are at risk and alerting for action to be taken you have the opportunity to prevent or mitigate the impact of an incident. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) 
+  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS08-BP07 Alert when workload anomalies are detected
<a name="ops_workload_health_workload_anomaly_alerts"></a>

 Raise an alert when workload anomalies are detected so that you can respond appropriately if necessary. 

 Your analysis of your workload metrics over time may establish patterns of behavior that you can quantify sufficiently to define an event or raise an alarm in response. 

 Once trained, the [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) feature can be used to [alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html) on detected anomalies or can provide overlaid expected values onto a [graph](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph_a_metric.html#create-metric-graph) of metric data for ongoing comparison. 

 **Common anti-patterns:** 
+  Your retail website sales have increased suddenly and dramatically. No one is aware. No one is trying to identify what led to this surge. No one is taking action to ensure quality customer experiences under the additional load. 
+  Following the application of a patch, your persistent servers are rebooting frequently, disrupting users. Your servers typically reboot up to three times but not more. No one is aware. No one is trying to identify why this is happening. 

 **Benefits of establishing this best practice:** By understanding patterns of workload behavior, you can identify unexpected behavior and take action if necessary. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Alert when workload anomalies are detected: Raise an alert when workload anomalies are detected so that you can respond appropriately if required. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
  +  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) 
+  [Invoking Lambda functions using Amazon SNS notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

# OPS08-BP08 Validate the achievement of outcomes and the effectiveness of KPIs and metrics
<a name="ops_workload_health_biz_level_view_workload"></a>

 Create a business-level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 

 AWS also has support for third-party log analysis systems and business intelligence tools through the AWS service APIs and SDKs (for example, Grafana, Kibana, and Logstash). 

 **Common anti-patterns:** 
+  Page response time has never been considered a contributor to customer satisfaction. You have never established a metric or threshold for page response time. Your customers are complaining about slowness. 
+  You have not been achieving your minimum response time goals. In an effort to improve response time, you have scaled up your application servers. You are now exceeding response time goals by a significant margin and also have significant unused capacity you are paying for. 

 **Benefits of establishing this best practice:** By reviewing and revising KPIs and metrics, you understand how your workload supports the achievement of your business outcomes and can identify where improvement is needed to reach business goals. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Validate the achievement of outcomes and the effectiveness of KPIs and metrics: Create a business level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary. 
  +  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
  +  [What is log analytics?](https://aws.amazon.com/log-analytics/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Using Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [What is log analytics?](https://aws.amazon.com/log-analytics/) 