

# Common mitigation strategies


To start, think about using *preventative* mitigations to prevent the failure mode from impacting the user story. Then you should think about *corrective* mitigations. Corrective mitigations help the system self-heal or adapt to changing conditions. Here's a list of common mitigations for each failure category that align to the resilience properties.


| 
| 
| **Failure category** | **Desired resilience properties** | **Mitigations** | 
| --- |--- |--- |
| Single points of failure (SPOFs) | Redundancy and fault tolerance |   Implement [redundancy](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)―for example, by using multiple EC2 instances behind Elastic Load Balancing (ELB).   Remove dependencies on the [AWS global service control plane](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/aws-service-types.html#global-services) and take dependencies only on global service data planes.   Use [graceful degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html) when a resource isn't available, so your system is statically stable to a single point of failure.   | 
| Excessive load | Sufficient capacity |   Key mitigation strategies are [rate limiting](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems), [load shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload) and work prioritization, [constant work](https://aws.amazon.com/builders-library/reliability-and-constant-work), [exponential backoff and retry with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) or not retrying at all, [putting the smaller service in control](https://aws.amazon.com/builders-library/avoiding-overload-in-distributed-systems-by-putting-the-smaller-service-in-control), [managing queue depth](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs), [automatic scaling](https://aws.amazon.com/autoscaling/), [avoiding cold caches](https://aws.amazon.com/builders-library/caching-challenges-and-strategies), and [circuit breakers](https://brooker.co.za/blog/2022/02/16/circuit-breakers.html).   You should also consider your capacity plan and think about future capacity and scaling limits, both related to AWS resources and limits within your system, that you might hit.   | 
| Excessive latency | Timely output |   Implement appropriately configured [timeouts](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) or adaptive timeouts (changing timeout values based on current and predicted latency conditions to potentially allow a slow dependency to make progress instead of giving up on slow requests).   Implement [exponential backoff and retry with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/), hedging, using technologies such as [multipath TCP](https://en.wikipedia.org/wiki/Multipath_TCP) when connecting to cloud services from on-premises environments and experiencing latency over specific routes, using [asynchronous interactions with loosely coupled systems](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_prevent_interaction_failure_loosely_coupled_system.html), [caching](https://aws.amazon.com/builders-library/caching-challenges-and-strategies), and [not throwing away work](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/).   | 
| Misconfiguration and bugs | Correct output |   The primary way to catch repeatable, functional errors in software is rigorous testing through mechanisms such as [static analysis](https://en.wikipedia.org/wiki/Static_program_analysis), [unit tests](https://en.wikipedia.org/wiki/Unit_testing), [integration tests](https://en.wikipedia.org/wiki/Integration_testing), [regression tests](https://en.wikipedia.org/wiki/Regression_testing), [load tests](https://docs.aws.amazon.com/prescriptive-guidance/latest/load-testing/welcome.html), and [resilience testing](https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/).   Implement strategies such as [infrastructure as code (IaC)](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/infrastructure-as-code.html) and [continuous integration and continuous delivery (CI/CD) automation](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments) to help mitigate misconfiguration threats.   Use deployment techniques such as [one-box](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/), [canary deployments](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/canary-deployments.html), fractional deployments that are aligned to fault isolation boundaries, or [blue/green deployments](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/blue-green-deployments.html) to reduce misconfigurations and bugs.   | 
| Shared fate | Fault isolation |   Implement [fault tolerance](https://aws.amazon.com/builders-library/minimizing-correlated-failures-in-distributed-systems) in your system and use logical and physical fault isolation boundaries such as multiple compute or container clusters, multiple AWS accounts, multiple AWS Identity and Access Management (IAM) principals, multiple Availability Zones, and perhaps multiple AWS Regions.   Techniques such as [cell-based architectures](https://youtu.be/swQbA4zub20) and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding) can also improve fault isolation.   Consider patterns such as [loose coupling](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_prevent_interaction_failure_loosely_coupled_system.html) and [graceful degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html) to prevent cascading failure. When you prioritize user stories, you can also use that prioritization to distinguish between user stories that are essential to the primary business function and user stories that can be gracefully degraded. For example, in an e-commerce site, you wouldn't want an impairment of the promotions widget on the website to impact the ability to process new orders.   | 

Although some of these mitigations require minimal effort to implement, others (such as adopting a cell-based architecture for predictable fault isolation and minimal shared fate failures) could require a redesign of the entire workload and not just the components of a particular user story. As discussed earlier, it's important to weigh the likelihood and impact of the failure mode against the trade-offs that you make to mitigate it.

In addition to mitigation techniques that apply to each failure mode category, you should think about mitigations that are required for the recovery of the user story or the entire system. For example, a failure might halt a workflow and prevent data from being written to intended destinations. In this case, you might need operational tooling to redrive the workflow or manually fix the data. You might also have to build a checkpointing mechanism into your workload to help prevent data loss when failures occur. Or you might have to build an andon cord to pause the workflow and stop accepting new work to prevent further harm. In these cases, you should think about the operational tools and guardrails you need.

Finally, you should always assume that humans are going to make mistakes as you develop your mitigation strategy. Although modern DevOps practices seek to automate operations, humans still have to interact with your workloads for various reasons. Incorrect human action could introduce a failure in any of the SEEMS categories, such as removing too many nodes during maintenance and causing an overload, or incorrectly setting a feature flag. These scenarios are really a failure in preventative guardrails. A root cause analysis should never end with the conclusion that "a human made a mistake." Instead, it should address the reasons why mistakes were possible in the first place. Therefore, your mitigation strategy should consider how human operators can interact with workload components and how to prevent or minimize the impact from human operator mistakes through safety guardrails.