

 This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

# Multi-AZ observability
Multi-AZ observability

 To be able to evacuate an Availability Zone during an event that is isolated to a single Availability Zone, you first must be able to detect that the failure is, in fact, isolated to a single Availability Zone. This requires high-fidelity visibility into how the system is behaving in each Availability Zone. Many AWS services provide out-of-the-box metrics that provide operational insights about your resources. For example, Amazon EC2 provides numerous metrics such as CPU utilization, disk reads and writes, and network traffic in and out. 

 However, as you build workloads that use these services, you need more visibility than just those standard metrics. You want visibility into the customer experience being provided by your workload. Additionally, you need your metrics to be aligned to the Availability Zones where they are being produced. This is the insight you need to detect differentially observable gray failures. That level of visibility requires instrumentation. 

 Instrumentation requires writing explicit code. This code should do things such as record how long tasks take, count how many items succeeded or failed, collect metadata about the requests, and so on. You also need to define thresholds ahead of time to define what is considered normal and what isn’t. You should outline objectives and different severity thresholds for latency, availability, and error counts in your workload. The Amazon Builders’ Library article [Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) provides a number of best practices. 

 Metrics should both be generated from the server-side as well as the client-side. A best practice for generating client-side metrics and understanding the customer experience is using [canaries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html), software that regularly probes your workload and records metrics. 

 In addition to producing these metrics, you also need to understand their context. One way to do this is by using [dimensions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Dimension). Dimensions give a metric a unique identity, and help explain what the metrics are telling you. For metrics that are used to identify failure in your workload (for example, latency, availability, or error count), you need to use dimensions that align to your [fault isolation boundaries](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/use-fault-isolation-to-protect-your-workload.html). 

 For example, if you are running a web service in one Region, across multiple Availability Zones, using a [Model-view-controller](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller) (MVC) web framework, you should use `Region`, [https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html), `Controller`, `Action`, and `InstanceId` as the dimensions for your dimension sets (if you were using microservices, you might use the service name and HTTP method instead of the controller and action names). This is because you expect different types of failures to be isolated by these boundaries. You wouldn’t expect a bug in your web service’s code that affects its ability to list products to also impact the home page. Similarly, you wouldn’t expect a full EBS volume on a single EC2 instance to affect other EC2 instances from serving your web content. The Availability Zone ID dimension is what enables you to identify Availability Zone-related impacts consistently across AWS accounts. You can find the Availability Zone ID in your workloads in a number of different ways. Refer to [Appendix A – Getting the Availability Zone ID](appendix-a-getting-the-availability-zone-id.md) for some examples. 

 While this document mainly uses Amazon EC2 as the compute resource in the examples, `InstanceId` could be replaced with a container ID for [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS) and [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (Amazon EKS) compute resources as components of your dimensions. 

 Your canaries can also use `Controller`, `Action`, `AZ-ID`, and `Region` as dimensions in their metrics if you have zonal endpoints for your workload. In this case, align your canaries to run in the Availability Zone that they are testing. This ensures that if an isolated Availability Zone event is impacting the Availability Zone in which your canary is running, it doesn’t record metrics that make a different Availability Zone it is testing appear unhealthy. For example, your canary can test each zonal endpoint for a service behind a Network Load Balancer (NLB) or Application Load Balancer (ALB) using its [zonal DNS names](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#dns-name). 

![\[Diagram showing a canary running on CloudWatch Synthetics or an AWS Lambda function testing each zonal endpoint of an NLB\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/canary-testing-for-zonal-impact.png)


 By producing metrics with these dimensions, you can establish [Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) that notify you when changes in availability or latency occur within those boundaries. You can also quickly analyze that data using [dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html). To use both metrics and logs efficiently, Amazon CloudWatch offers the [embedded metric format](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html) (EMF) that enables you to embed custom metrics with log data. CloudWatch automatically extracts the custom metrics so you can visualize and alarm on them. AWS provides several [client libraries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Libraries.html) for different programming languages that make it easy to get started with EMF. They can be used with Amazon EC2, Amazon ECS, Amazon EKS, [AWS Lambda](https://aws.amazon.com/lambda/), and on-premises environments. With metrics embedded into your logs, you can also use [Amazon CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to create time series graphs that display contributor data. In this scenario, we could display data grouped by dimensions like `AZ-ID`, `InstanceId`, or `Controller` as well as any other field in the log like `SuccessLatency` or `HttpResponseCode`. 

```
{ 
  "_aws": { 
    "Timestamp": 1634319245221, 
    "CloudWatchMetrics": [ 
      { 
        "Namespace": "workloadname/frontend", 
        "Metrics": [ 
          { "Name": "2xx", "Unit": "Count" }, 
          { "Name": "3xx", "Unit": "Count" }, 
          { "Name": "4xx", "Unit": "Count" }, 
          { "Name": "5xx", "Unit": "Count" }, 
          { "Name": "SuccessLatency", "Unit": "Milliseconds" } 
        ], 
        "Dimensions": [ 
          [ "Controller", "Action", "Region", "AZ-ID", "InstanceId"], 
          [ "Controller", "Action", "Region", "AZ-ID"], 
          [ "Controller", "Action", "Region"] 
        ] 
      }
    ], 
    "LogGroupName": "/loggroupname" 
  }, 
  "CacheRefresh": false, 
  "Host": "use1-az2-name.example.com", 
  "SourceIp": "34.230.82.196", 
  "TraceId": "|e3628548-42e164ee4d1379bf.", 
  "Path": "/home", 
  "OneBox": false, 
  "Controller": "Home", 
  "Action": "Index", 
  "Region": "us-east-1", 
  "AZ-ID": "use1-az2", 
  "InstanceId": "i-01ab0b7241214d494", 
  "LogGroupName": "/loggroupname", 
  "HttpResponseCode": 200,
  "2xx": 1, 
  "3xx": 0, 
  "4xx": 0, 
  "5xx": 0, 
  "SuccessLatency": 20 
}
```

This log has three sets of dimensions. They progress in order of granularity, from instance to Availability Zone to Region (`Controller` and `Action` are always included in this example). They support creating alarms across your workload that indicate when there is impact to a specific controller action in a single instance, in a single Availability Zone, or within a whole AWS Region. These dimensions are used for the count of 2xx, 3xx, 4xx, and 5xx HTTP response metrics, as well as the latency for successful request metrics (if the request failed, it would also record a metric for failed request latency). The log also records other information such as the HTTP path, the source IP of the requestor, and whether this request required the local cache to be refreshed. These data points can then be used to calculate the availability and latency of each API the workload provides. 

**A note on using HTTP response codes for availability metrics**  
Typically, you can consider 2xx and 3xx responses as successful, and 5xx as failures. 4xx response codes fall somewhere in the middle. Usually, they are produced due to a client error. Maybe a parameter is out of range leading to a [400 response](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), or they’re requesting something that doesn’t exist, resulting in a 404 response. You wouldn’t count these responses against your workload’s availability. However, this could also be the result of a bug in the software.  
For example, if you’ve introduced stricter input validation that rejects a request that would have succeeded before, the 400 response might count as a drop in availability. Or maybe you’re throttling the customer and returning a 429 response. While throttling a customer protects your service to maintain its availability, from the customer’s perspective, the service isn’t available to process their request. You’ll need to decide whether or not 4xx response codes are part of your availability calculation. 

While this section has outlined using CloudWatch as a way to collect and analyze metrics, it’s not the only solution you can use. You might choose to also send metrics into Amazon Managed Service for Prometheus and Amazon Managed Grafana, an Amazon DynamoDB table, or use a third-party monitoring solution. The key is that the metrics your workload produces must contain context about the fault isolation boundaries of your workload. 

With workloads that produce metrics with dimensions aligned to fault isolation boundaries, you can create observability that detects Availability Zone isolated failures. The following sections describe three complimentary approaches for detecting failures that arise from the impairment of a single Availability Zone. 

**Topics**
+ [

# Failure detection with CloudWatch composite alarms
](failure-detection-with-cloudwatch-composite-alarms.md)
+ [

# Failure detection using outlier detection
](failure-detection-using-outlier-detection.md)
+ [

# Failure detection of single instance zonal resources
](failure-detection-of-single-instance-zonal-resources.md)
+ [

# Summary
](summary.md)

# Failure detection with CloudWatch composite alarms
Failure detection with CloudWatch composite alarms

 In CloudWatch metrics, each dimension set is a unique metric, and you can create a CloudWatch alarm on each one. You can then create [Amazon CloudWatch composite alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html) to aggregate these metrics. 

 In order to accurately detect impact, the examples in this paper will use two different CloudWatch alarm structures for each dimension set they alarm on. Each alarm will use a **Period** of one-minute, meaning the metric is evaluated once per minute. The first approach is going to use three consecutive breaching data points by setting the **Evaluation Periods** and **Datapoints to Alarm** to three, meaning impact for three minutes total. The second approach is going to use an "M out of N" when any 3 data points in a five-minute window are breaching by setting the **Evaluation Periods** to five and **Datapoints to Alarm** to three. This provides an ability to detect a constant signal, as well as one that fluctuates over a short time. The time durations and number of data points contained here are a suggestion, use values that make sense for your workloads. 

## Detect impact in a single Availability Zone
Detect impact in a single Availability Zone

 Using this construct, consider a workload that uses `Controller`, `Action`, `InstanceId`, `AZ-ID`, and `Region` as dimensions. The workload has two controllers, Products and Home, and one action per controller, List and Index respectively. It operates in three Availability Zones in the `us-east-1` Region. You would create two alarms for availability for each `Controller` and `Action` combination in each Availability Zone as well as two alarms for latency for each. Then, you can optionally choose to create a composite alarm for availability for each `Controller` and `Action` combination. Finally, you create a composite alarm that aggregates all of the availability alarms for the Availability Zone. This is shown in the following figure for a single Availability Zone, `use1-az1`, using the optional composite alarm for each `Controller` and `Action` combination (similar alarms would exist for the `use1-az2` and `use1-az3` Availability Zones as well, but are not shown for simplicity). 

![\[Diagram showing a composite alarm structure for availability in use1-az1\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/composite-alarm-structure-availability.png)


 You would also build a similar alarm structure for latency as well, shown in the next figure. 

![\[A diagram showing a Composite alarm structure for latency in use1-az1\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/composite-alarm-structure-latency.png)


For the remainder of the figures in this section, only the `az1-availability` and `az1-latency` composite alarms will be shown at the top level. These composite alarms, `az1-availability` and `az1-latency`, will tell you if either availability drops below or latency rises above defined thresholds in a particular Availability Zone for any part of your workload. You might also want to consider measuring throughput to detect impact that prevents your workload in a single Availability Zone from receiving work. You can integrate alarms produced from the metrics emitted by your canaries into these composite alarms as well. That way, if either the server-side or client-side see impacts in availability or latency, the alarm will create an alert. 

## Ensure the impact isn’t Regional
Ensure the impact isn’t Regional

Another set of composite alarms can be used to ensure that only an isolated Availability Zone event causes the alarm to be activated. This is performed by ensuring that an Availability Zone composite alarm is in the `ALARM` state while the composite alarms for the other Availability Zones are in the `OK` state. This will result in one composite alarm per Availability Zone that you use. An example is shown in the following figure (remember that there are alarms for latency and availability in `use1-az2` and `use1-az3`, `az2-latency`, `az2-availability`, `az3-latency`, and `az3-availability`, that are not pictured for simplicity). 

![\[A diagram showing a composite alarm structure to detect impact isolated to a single AZ\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/composite-alarm-structure-impact.png)


## Ensure the impact isn’t caused by a single instance
Ensure the impact isn’t caused by a single instance

A single instance (or a small percentage of your overall fleet) can cause disproportionate impact to availability and latency metrics that could make the whole Availability Zone appear to be affected, when in fact it is not. It is faster and just as effective to remove a single problematic instance than evacuate an Availability Zone. 

Instances and containers are typically treated as ephemeral resources, frequently replaced with services such as [AWS Auto Scaling](https://aws.amazon.com/autoscaling/). It’s difficult to create a new CloudWatch alarm every time a new instance is created (but certainly possible using [Amazon EventBridge](https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html) or [Amazon EC2 Auto Scaling lifecycle hooks](https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html)). Instead, you can use [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to identify the quantity of contributors to availability and latency metrics. 

As an example, for an HTTP web application, you can create a rule to identify top contributors for 5xx HTTP responses in each Availability Zone. This will identify which instances are contributing to a drop in availability (our availability metric defined above is driven by the presence of 5xx errors). Using the EMF log example, create a rule using a key of `InstanceId`. Then, filter the log by the `HttpResponseCode` field. This example is a rule for the `use1-az1` Availability Zone. 

```
{
    "AggregateOn": "Count",
    "Contribution": {
        "Filters": [
            {
                "Match": "$.InstanceId",
                "IsPresent": true
            },
            {
                "Match": "$.HttpStatusCode",
                "IsPresent": true
            },
            {
                "Match": "$.HttpStatusCode",
                "GreaterThan": 499
            },
            {
                "Match": "$.HttpStatusCode",
                "LessThan": 600
            },
            {
                "Match": "$.AZ-ID",
                "In": ["use1-az1"]
            },
        ],
        "Keys": [
            "$.InstanceId"
        ]
    },
    "LogFormat": "JSON",
    "LogGroupNames": [
        "/loggroupname"
    ],
    "Schema": {
        "Name": "CloudWatchLogRule",
        "Version": 1
    }
}
```

CloudWatch alarms can be created based on these rules as well. You can create alarms based on Contributor Insights rules using [metric math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) and the `INSIGHT_RULE_METRIC` function with the `UniqueContributors` metric. You can also create additional Contributor Insights rules with CloudWatch alarms for metrics like latency or error counts in addition to ones for availability. These alarms can be used with the isolated Availability Zone impact composite alarms to ensure that single instances don’t activate the alarm. The metric for the insights rule for `use1-az1` might look like the following: 

```
 INSIGHT_RULE_METRIC("5xx-errors-use1-az1", "UniqueContributors") 
```

You can define an alarm when this metric is greater than a threshold; for this example, two. It is activated when the unique contributors to 5xx responses goes above that threshold, indicating the impact is originating from more than two instances. The reason this alarm uses a greater-than comparison instead of less-than is to make sure that a zero value for unique contributors doesn’t set off the alarm. This tells you that the impact is *not* from a single instance. Adjust this threshold for your individual workload. A general guide is to make this number 5% or more of the total resources in the Availability Zone. More than 5% of your resources being affected shows statistical significance, given a sufficient sample size. 

## Putting it all together
Putting it all together

The following figure shows the complete composite alarm structure for a single Availability Zone: 

![\[A diagram showing a complete composite alarm structure for determining single-AZ impact\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/composite-alarm-structure-complete.png)


 The final composite alarm, `use1-az1-isolated-impact`, is activated when the composite alarm indicating isolated Availability Zone impact from latency or availability, `use1-az1-aggregate-alarm`, is in `ALARM` state and when the alarm based on the Contributor Insights rule for that same Availability Zone, `not-single-instance-use1-az1`, is also in `ALARM` state (meaning that the impact is more than a single instance). You would create this stack of alarms for each Availability Zone that your workload uses. 

You can attach an [Amazon Simple Notification Service](https://aws.amazon.com/sns/) (Amazon SNS) alert to this final alarm. All of the previous alarms are configured without an action. The alert could notify an operator via email to start manual investigation. It could also initiate automation to evacuate the Availability Zone. However, a word of caution on building automation to respond to these alerts. After an Availability Zone evacuation happens, the result should be that the increased error rates are mitigated and the alarm goes back to an `OK` state. If impact happens in another Availability Zone, it’s possible that the automation could evacuate a second or third Availability Zone, potentially removing all of the workload’s available capacity. The automation should check to see if an evacuation has already been performed before taking any action. You may also need to scale resources in other Availability Zones before an evacuation is successful. 

When you add new controllers or actions to your MVC web app, or a new microservice, or in general, any additional functionality you want to separately monitor, you only need to modify a few alarms in this setup. You will create new availability and latency alarms for that new functionality and then add those to the appropriate Availability Zone aligned availability and latency composite alarms, `az1-latency` and `az1-availability` in the example we’ve been using here. The remaining composite alarms remain static after they have been configured. This makes onboarding new functionality with this approach a simpler process. 

# Failure detection using outlier detection
Failure detection using outlier detection

One gap with the previous approach could arise when you see elevated error rates in multiple Availability Zones that are occurring for an *uncorrelated* reason. Imagine a scenario where you have EC2 instances deployed across three Availability Zones and your availability alarm threshold is 99%. Then, a single Availability Zone impairment occurs, isolating many instances and causes availability in that zone to drop to 55%. At the same time, but in a different Availability Zone, a single EC2 instance exhausts all of the storage on its EBS volume, and can no longer write logs files. This causes it to start returning errors, but it still passes the load balancer health checks because those don’t trigger a log file to be written. This results in availability dropping to 98% in that Availability Zone. In this case, your single Availability Zone impact alarm wouldn’t activate because you are seeing an availability impact in multiple Availability Zones. However, you could still mitigate almost all of the impact by evacuating the impaired Availability Zone. 

In some types of workloads, you might experience errors consistently across all Availability Zones where the previous availability metric might not be useful. Take AWS Lambda for example. AWS allows customers to create their own code to run in the Lambda function. To use the service, you have to upload your code in a ZIP file, including dependencies, and define the entry point to the function. But sometimes customers get this part wrong, for example, they might forget a critical dependency in the ZIP file, or mistype the method name in the Lambda function definition. This causes the function to fail to be invoked and results in an error. AWS Lambda sees these kinds of errors all the time, but they’re not indicative that anything is necessarily unhealthy. However, something like an Availability Zone impairment might also cause these errors to appear. 

To find signal in this kind of noise, you can use outlier detection to determine if there is a statistically significant skew in the number of errors among Availability Zones. Although we see errors across multiple Availability Zones, if there was truly a failure in one of them, we’d expect to see a much higher error rate in that Availability Zone compared to the other ones, or potentially much lower. But how much higher or lower? 

One way to do this analysis is by using a [chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) (*χ*2) test to detect statistically significant differences in error rates between Availability Zones (there are [many different algorithms for performing outlier detection](https://dataprocessing.aixcape.org/DataPreprocessing/DataCleaning/OutlierDetection/index.html)). Let’s look at how the chi-squared test works. 

A chi-squared test evaluates the probability that some distribution of results is likely to occur. In this case, we’re interested in the distribution of errors across some defined set of AZs. For this example, to make the math easier, consider four Availability Zones. 

First, establish the *null hypothesis*, which defines what you believe the default outcome is. In this test, the null hypothesis is that you expect errors to be evenly distributed across each Availability Zone. Then, generate the *alternative hypothesis*, which is that the errors are not evenly distributed indicating an Availability Zone impairment. Now you can test these hypotheses using data from your metrics. For this purpose, you’ll sample your metrics from a five-minute window. Suppose you get 1000 published data points in that window, in which you see 100 total errors. You expect that with an even distribution the errors would occur 25% of the time in each of the four Availability Zones. Assume the following table shows what you expected compared to what you actually saw. 

*Table 1: Expected versus actual errors seen *


|  AZ  |  Expected  |  Actual  | 
| --- | --- | --- | 
| use1-az1 |  25  |  20  | 
| use1-az2 |  25  |  20  | 
| use1-az3 |  25  |  25  | 
| use1-az4 |  25  |  35  | 

So, you see that the distribution in reality isn’t even. However, you might believe that this occurred due to some level of randomness in the data points you sampled. There’s some level of probability that this type of distribution could occur in the sample set and still assume that the null hypothesis is true. This leads to the following question: What is the probability of getting a result at least this extreme? If that probability is below a defined threshold, you reject the null hypothesis. To be [https://en.wikipedia.org/wiki/Statistical_significance](https://en.wikipedia.org/wiki/Statistical_significance), this probability should be 5% or less.1 

1 Craparo, Robert M. (2007). "Significance level". In Salkind, Neil J. Encyclopedia of Measurement and Statistics 3. Thousand Oaks, CA: SAGE Publications. pp. 889–891. ISBN 1-412-91611-9. 

 How do you calculate the probability of this outcome? You use the *χ2* statistic that provides very well-studied distributions and can be used to determine the probability of getting a result this extreme or more extreme using this formula. 

![\[Formulas for Ei, Oi, and X2\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/formulas1.png)


 For our example, this results in: 

![\[Formulas for Ei, Oi, and X2 using our example, resulting in an answer of 6.\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/formulas2.png)


 So, what does `6` mean in terms of our probability? You need to look at a chi-squared distribution with the appropriate degree of freedom. The following figure shows several chi-squared distributions for different degrees of freedom. 

![\[Graph showing chi-squared distributions for different degrees of freedom\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/chi-squared-distributions.png)


 The degree of freedom is calculated as one less than the number of choices in the test. In this case, because there are four Availability Zones, the degree of freedom is three. Then, you want to know the area under the curve (the integral) for *x ≥ 6* on the *k = 3* plot. You can also use a pre-calculated table with commonly used values to approximate that value. 

*Table 2: Chi-squared critical values *


| Degrees of freedom |  Probability less than the critical value  |   |  **0.75**  |  **0.90**  |  **0.95**  |  **0.99**  |  **0.999**  | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
|  1  |  1.323  |  2.706  |  3.841  |  6.635  |  10.828  | 
|  2  |  2.773  |  4.605  |  5.991  |  9.210  |  13.816  | 
|  3  |  4.108  |  6.251  |  7.815  |  11.345  |  16.266  | 
|  4  |  5.385  |  7.779  |  9.488  |  13.277  |  18.467  | 

For three degrees of freedom, the chi-squared value of six falls between the 0.75 and 0.9 probability columns. What this means is there is a greater than 10% chance of this distribution occurring, which is not less than the 5% threshold. Therefore, you accept the *null hypothesis* and determine there is *not* a statistically significant difference in error rates among the Availability Zones. 

Performing a chi-squared statistics test isn’t natively supported in CloudWatch metric math, so you’ll need collect the applicable error metrics from CloudWatch and run the test in a compute environment like Lambda. You can decide to perform this test at something like an MVC Controller/Action or individual microservice level, or at the Availability Zone level. You’ll need to consider whether an Availability Zone impairment would affect each Controller/Action or microservice equally, or whether something like a DNS failure might cause impact in a low throughput service and not in a high throughput service, which could mask the impact when aggregated. In either case, select the appropriate dimensions to create the query. The level of granularity will also impact the resulting CloudWatch alarms you create. 

Collect the error count metric for each AZ and Controller/Action in a specified time window. First, calculate the result of the chi-squared test as either true (there was a statistically significant skew) or false (there was wasn’t, that is, the null hypothesis holds). If the result is false, publish a 0 data point to your metric stream for chi-squared results for each Availability Zone. If the result is true, publish a 1 data point for the Availability Zone with the errors farthest from the expected value and a 0 for the others (refer to [Appendix B – Example chi-squared calculation](appendix-b-example-chi-squared-calculation.md) for sample code that can be used in a Lambda function). You can follow the same approach as the previous availability alarms by using creating a 3 in a row CloudWatch metric alarm and a 3 out of 5 CloudWatch metric alarm based on the data points being produced by the Lambda function. As in the previous examples, this approach can be modified to use more or less data points in a shorter or longer window. 

Then, add these alarms to your existing Availability Zone availability alarm for the Controller and Action combination, shown in the following figure.

![\[Diagram showing integrating the chi-squared statistics test with composite alarms\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/statistics-test-with-composite-alarms.png)


As mentioned previously, when you onboard new functionality in your workload, you only need to create the appropriate CloudWatch metric alarms that are specific to that new functionality and update the next tier in the composite alarm hierarchy to include those alarms. The rest of the alarm structure remains static. 

# Failure detection of single instance zonal resources
Failure detection of single instance zonal resources

In some cases, you might have a single active instance of a zonal resource, most commonly systems that require a single-writer component such as a relational database (such as Amazon RDS) or a distributed cache (such as [Amazon ElastiCache (Redis OSS)](https://aws.amazon.com/elasticache/redis/)). If a single Availability Zone impairment affects the Availability Zone that the primary resource is in, it can cause impact to every Availability Zone that accesses the resource. This could cause availability thresholds to be crossed in every Availability Zone, meaning the first approach wouldn’t correctly identify the single Availability Zone source of impact. Additionally, you would likely see similar error rates in each Availability Zone, meaning the outlier analysis also wouldn’t detect the problem. What this means is that you need to implement additional observability to specifically detect this scenario. 

It's likely that the resource you’re concerned about will produce its own metrics about its health, but during an Availability Zone impairment that resource might not be able to deliver those metrics. In this scenario, you should create or update alarms to know when you are *flying blind*. If there are important metrics that you already monitor and alarm on, you can configure the alarm to treat the [missing data](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data) as breaching. This will help you know if the resource stops reporting data, and can be included in the same *in a row* and *m out of n* alarms used previously. 

It’s also possible that in some of the metrics that indicate the health of the resource that it publishes a zero valued data point when there is no activity. If the impairment is preventing interactions with the resource, you can’t use the missing data approach for these kinds of metrics. You also probably don’t want to alarm on the value being zero, since there could be legitimate scenarios where that is within normal thresholds. The best approach to detecting this type of problem is with metrics being produced by the resources using this dependency. In this case we want to detect impact in *multiple* Availability Zones using composite alarms. These alarms should use a handful of critical metrics categories related to the resource. A few examples are listed below: 
+  **Throughput** – The rate of incoming units of work. This could be transactions, reads, writes, and so on. 
+  **Availability** – Measure the number of successful vs failed units of work. 
+  **Latency** – Measure multiple percentiles of latency for successful work performed across critical operations. 

   Once again, you can create the *in a row* and *m out of n* metric alarms for each metric in each metric category that you want to measure. As before, these can be combined into a composite alarm to determine that this shared resource is the source of impact across Availability Zones. You want to be able to identify impact to more than one Availability Zone with the composite alarms, but the impact does not necessarily need to be *all* Availability Zones. The high-level composite alarm structure for this kind of approach is shown in the following figure.   
![\[Diagram showing an example of creating alarms to detect impact to multiple Availability Zones caused by a single resource\]](http://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/images/creating-alarms-to-detect-impact.png)

You will notice that this diagram is less prescriptive about what type of metric alarms should be used and the hierarchy of the composite alarms. This is because discovering this kind of problem can be difficult and will require careful attention to the right signals for the shared resource. Those signals may also need to be evaluated in specific ways. 

Additionally, you should notice that the `primary-database-impact` alarm is not associated with a specific Availability Zone. This is because the primary database instance can be located in any Availability Zone that it is configured to use, and there’s not a CloudWatch metric that specifies where it is. When you see this alarm activate, you should use it as a signal that there may be a problem with the resource and initiate a failover to another Availability Zone, if it hasn’t been done automatically. After moving the resource to another Availability Zone, you can wait and see if your isolated Availability Zone alarm is activated, or you can choose to preemptively invoke your Availability Zone evacuation plan. 

# Summary


 This section described three approaches to help identify single Availability Zone impairments. Each approach should be used together to provide a holistic view of your workload’s health.

The CloudWatch composite alarm approach allows you to find problems where the skew in availability isn’t statistically significant, say availabilities of 98% (the impaired Availability Zone), 100%, and 99.99%, that isn’t caused by a single, shared resource.

Outlier detection will help detect single Availability Zone impairments where you have uncorrelated errors happening in multiple Availability Zones that all surpass your alarm threshold.

Finally, identifying degradation of a single instance zonal resource helps discover when an Availability Zone impairment affects a resource that is shared across Availability Zones.

The resulting alarms from each one of these patterns can be combined into a CloudWatch composite alarm hierarchy to discover when single Availability Zone impairments occur and have impact to the availability or latency of your workload. 