

# Mitigating potential failures
<a name="mitigating-failures"></a>

Now that you have potential failures for components in a user story, you can focus on mitigations. First, review the potential trade-offs in relation to the potential impact and likelihood of each failure you uncovered. Then determine the required level of observability and select a mitigation strategy. The trade-offs should include the effort to instrument the right level of observability and mitigation strategy. Lastly, determine the right cadence to conduct regular resilience analysis reviews.

**Sections**
+ [Understanding trade-offs and risks](tradeoffs.md)
+ [Failure mode observability](observability.md)
+ [Common mitigation strategies](mitigation-strategies.md)
+ [Continuous improvement](continuous-improvement.md)

# Understanding trade-offs and risks
<a name="tradeoffs"></a>

Resilient architectures should use a handful of well-tested, simple, and reliable mechanisms to respond to failures. To achieve the highest levels of resilience, workloads should automatically detect and recover from as many failure modes as possible. Doing so requires extensive investment in performing resilience analysis. This means that achieving higher levels of resilience involves making trade-offs. However, as you continue to make trade-offs, you reach a point of diminishing returns relative to your resilience objectives. Here are the most typical trade-offs:
+ **Cost** – Redundant components, enhanced observability, additional tools, or increased resource utilization will result in increased costs.
+ **System complexity** – Detecting and responding to failure modes, including the mitigation solutions, and potentially not using managed services result in increased system complexity.
+ **Engineering effort** – Additional developer hours are required to build solutions to detect and respond to failure modes.
+ **Operational overhead** – Monitoring and operating a system that handles more failure modes can add operational overhead, particularly when you can't use managed services to mitigate specific failure modes.
+ **Latency and consistency** – Building distributed systems that favor availability require trade-offs in consistency and latency, as described in the [PACELC theorem](http://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf).



![\[The probability of achieving resilience objectives based on the trade-offs being made, where you reach a point of diminishing returns\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/resilience-analysis-framework/images/tradeoffs.png)


As you consider the mitigations for the identified failure modes in the user story, consider the trade-offs you need to make. As with security, resilience is an optimization problem. You have to make a decision on whether to avoid, mitigate, transfer, or accept the risks posed by the identified failure. There might be some failure modes you can avoid, a set that you accept, and a few that you can transfer. You might choose to mitigate many of the failure modes you identify. To determine which approach to take, perform an assessment by asking two questions: What is the likelihood that the failure will occur? What is the impact to the workload if it does occur?

**Likelihood** is how plausible it is that an event will occur. For example, if the user story has a component that operates on a single Amazon Elastic Compute Cloud (Amazon EC2) instance, the component might be disrupted at some point during the system's operation, perhaps due to patching procedures or operating system errors. Alternatively, a database that's managed by Amazon Relational Database Service (Amazon RDS) that synchronizes data between its primary and secondary instances has a low plausibility of becoming completely unavailable.

**Impact** is an estimate of the harm that an event can cause. It should be assessed from both a financial and a reputational perspective, and is relative to the value of the user stories it impacts. For example, an overwhelmed database could have a significant impact on an e-commerce system's ability to accept new orders. However, the loss of a single instance out of a fleet of 20 instances behind a load balancer would likely have very little impact.

You can compare the answers to these questions against the cost of the trade-offs you have to make to mitigate the risk. When you consider this information in view of your risk threshold and your resilience objectives, it informs your decision on which failure modes you plan to actively mitigate.

# Failure mode observability
<a name="observability"></a>

To mitigate a failure mode, you first have to detect that it is currently impacting, or is about to impact, your workload. A mitigation is effective only if there is a signal that an action has to be taken. This means that part of creating any mitigation includes, at the very least, verifying that you have or are building the observability that's necessary to detect the impact of the failure.

You should consider the observable symptoms of the failure mode in two dimensions:
+ What are the *leading indicators* that inform you that the system is approaching a condition where an impact might be seen soon?
+ What are the *lagging indicators* that can show the failure mode's impact as quickly as possible after it has occurred?

For example, an excessive load failure that's applied to a database element could have a connection count as a leading indicator. You can see the steady increase in connection counts as a leading indicator that the database might soon exceed the connection limit, so you can take action, such as terminating the least recently used connections, to reduce the connection count. The lagging indicator indicates when the database connection limit has been exceeded and database connection errors elevate. In addition to collecting application and infrastructure metrics, consider gathering [key performance indicators (KPI)](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_process_culture_establish_key_performance_indicators.html) to detect when failures impact your customer experience.

When possible, we recommend that you include both types of indicators in your observability strategy. In some cases, you might not be able to create leading indicators, but you should always plan to have a lagging indicator for each failure that you want to mitigate. To choose the right mitigation, you also should consider whether a leading or a lagging indicator detected the failure. For example, consider a sudden spike in traffic to your website. You would likely see only a lagging indicator. In this case, automatic scaling alone might not be the best mitigation because it takes time to deploy new resources, whereas throttling could prevent the overload almost immediately and give your application time to scale or reduce the load. Conversely, for a gradual increase in traffic, you would see a leading indicator. In this case, throttling wouldn't be appropriate because you have time to respond by automatically scaling your system.

# Common mitigation strategies
<a name="mitigation-strategies"></a>

To start, think about using *preventative* mitigations to prevent the failure mode from impacting the user story. Then you should think about *corrective* mitigations. Corrective mitigations help the system self-heal or adapt to changing conditions. Here's a list of common mitigations for each failure category that align to the resilience properties.


| 
| 
| **Failure category** | **Desired resilience properties** | **Mitigations** | 
| --- |--- |--- |
| Single points of failure (SPOFs) | Redundancy and fault tolerance |   Implement [redundancy](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)―for example, by using multiple EC2 instances behind Elastic Load Balancing (ELB).   Remove dependencies on the [AWS global service control plane](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/aws-service-types.html#global-services) and take dependencies only on global service data planes.   Use [graceful degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html) when a resource isn't available, so your system is statically stable to a single point of failure.   | 
| Excessive load | Sufficient capacity |   Key mitigation strategies are [rate limiting](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems), [load shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload) and work prioritization, [constant work](https://aws.amazon.com/builders-library/reliability-and-constant-work), [exponential backoff and retry with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) or not retrying at all, [putting the smaller service in control](https://aws.amazon.com/builders-library/avoiding-overload-in-distributed-systems-by-putting-the-smaller-service-in-control), [managing queue depth](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs), [automatic scaling](https://aws.amazon.com/autoscaling/), [avoiding cold caches](https://aws.amazon.com/builders-library/caching-challenges-and-strategies), and [circuit breakers](https://brooker.co.za/blog/2022/02/16/circuit-breakers.html).   You should also consider your capacity plan and think about future capacity and scaling limits, both related to AWS resources and limits within your system, that you might hit.   | 
| Excessive latency | Timely output |   Implement appropriately configured [timeouts](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) or adaptive timeouts (changing timeout values based on current and predicted latency conditions to potentially allow a slow dependency to make progress instead of giving up on slow requests).   Implement [exponential backoff and retry with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/), hedging, using technologies such as [multipath TCP](https://en.wikipedia.org/wiki/Multipath_TCP) when connecting to cloud services from on-premises environments and experiencing latency over specific routes, using [asynchronous interactions with loosely coupled systems](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_prevent_interaction_failure_loosely_coupled_system.html), [caching](https://aws.amazon.com/builders-library/caching-challenges-and-strategies), and [not throwing away work](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/).   | 
| Misconfiguration and bugs | Correct output |   The primary way to catch repeatable, functional errors in software is rigorous testing through mechanisms such as [static analysis](https://en.wikipedia.org/wiki/Static_program_analysis), [unit tests](https://en.wikipedia.org/wiki/Unit_testing), [integration tests](https://en.wikipedia.org/wiki/Integration_testing), [regression tests](https://en.wikipedia.org/wiki/Regression_testing), [load tests](https://docs.aws.amazon.com/prescriptive-guidance/latest/load-testing/welcome.html), and [resilience testing](https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/).   Implement strategies such as [infrastructure as code (IaC)](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/infrastructure-as-code.html) and [continuous integration and continuous delivery (CI/CD) automation](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments) to help mitigate misconfiguration threats.   Use deployment techniques such as [one-box](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/), [canary deployments](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/canary-deployments.html), fractional deployments that are aligned to fault isolation boundaries, or [blue/green deployments](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/blue-green-deployments.html) to reduce misconfigurations and bugs.   | 
| Shared fate | Fault isolation |   Implement [fault tolerance](https://aws.amazon.com/builders-library/minimizing-correlated-failures-in-distributed-systems) in your system and use logical and physical fault isolation boundaries such as multiple compute or container clusters, multiple AWS accounts, multiple AWS Identity and Access Management (IAM) principals, multiple Availability Zones, and perhaps multiple AWS Regions.   Techniques such as [cell-based architectures](https://youtu.be/swQbA4zub20) and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding) can also improve fault isolation.   Consider patterns such as [loose coupling](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_prevent_interaction_failure_loosely_coupled_system.html) and [graceful degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html) to prevent cascading failure. When you prioritize user stories, you can also use that prioritization to distinguish between user stories that are essential to the primary business function and user stories that can be gracefully degraded. For example, in an e-commerce site, you wouldn't want an impairment of the promotions widget on the website to impact the ability to process new orders.   | 

Although some of these mitigations require minimal effort to implement, others (such as adopting a cell-based architecture for predictable fault isolation and minimal shared fate failures) could require a redesign of the entire workload and not just the components of a particular user story. As discussed earlier, it's important to weigh the likelihood and impact of the failure mode against the trade-offs that you make to mitigate it.

In addition to mitigation techniques that apply to each failure mode category, you should think about mitigations that are required for the recovery of the user story or the entire system. For example, a failure might halt a workflow and prevent data from being written to intended destinations. In this case, you might need operational tooling to redrive the workflow or manually fix the data. You might also have to build a checkpointing mechanism into your workload to help prevent data loss when failures occur. Or you might have to build an andon cord to pause the workflow and stop accepting new work to prevent further harm. In these cases, you should think about the operational tools and guardrails you need.

Finally, you should always assume that humans are going to make mistakes as you develop your mitigation strategy. Although modern DevOps practices seek to automate operations, humans still have to interact with your workloads for various reasons. Incorrect human action could introduce a failure in any of the SEEMS categories, such as removing too many nodes during maintenance and causing an overload, or incorrectly setting a feature flag. These scenarios are really a failure in preventative guardrails. A root cause analysis should never end with the conclusion that "a human made a mistake." Instead, it should address the reasons why mistakes were possible in the first place. Therefore, your mitigation strategy should consider how human operators can interact with workload components and how to prevent or minimize the impact from human operator mistakes through safety guardrails.

# Continuous improvement
<a name="continuous-improvement"></a>

Resilience is a [continuous process](https://medium.com/the-cloud-architect/towards-continuous-resilience-3c7fbc5d232b). Over your system's lifecycle, the environment in which it operates will change. To ensure that your system remains resilient, you should integrate the framework into your periodic operational and architectural reviews. You might find new failure modes that you didn't identify the first time through, or there might be new or previously unthought of mitigations that you can put in place. Resilience analysis should be an iterative process and not a one-time exercise.

You should empirically test your mitigation strategies with processes such as [chaos engineering](https://aws.amazon.com/solutions/resilience/chaos-engineering/) or [game days](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.gameday.en.html) to validate that they work as expected. If you don't have a rigorous testing mechanism, you won't be confident that the mitigation will work as expected when you need it. During resilience analysis, you might determine that a failure mode is already handled by a specific mitigation, but it's important to test those assumptions as well. You should test for both existing mitigations and new mitigations that were created by using the resilience analysis framework.

You should also evaluate how well you performed the analysis through team retrospectives. Did everyone know what they were working on during the analysis? Did the number of failure modes you found through resilience analysis align with the team's expectations? Could you identify mitigations for all the failure modes you discovered? Did the team find the process useful? Do you believe it will lead to improvements in the resilience of your workload?

When real failure events happen that impact your workload's availability, record the specific failure mode, the components that were part of the failure, and the mitigation pattern that was used. Make this metadata searchable in your post-incident analysis tool so you can determine which failure modes and components to focus on in the future. Throughout this process, you can engage your AWS account team and solutions architects.