The role of goals in chaos engineering adoption
It's common for the initial goals to emerge organically from grassroots chaos engineering efforts within an organization. Driven by a need to address their own recurring issues, these teams or groups often explore chaos-engineering practices without explicit approval or prioritization from higher levels.
Teams can use these results to build a compelling case for broader organizational adoption, effectively becoming a proving ground for other teams.
After the benefits from grassroots efforts become too significant to ignore, these teams can elevate their efforts and knowledge to leadership and set goals. This increased visibility can facilitate the adoption of organization-wide resilience objectives and lead to the support and resources that are necessary for chaos engineering implementation.
Goals, particularly those driven by leadership and established in response to significant outages, play a crucial role in catalyzing the adoption of chaos engineering practices. Common types of goals include the following:
-
Availability goals to identify and reduce single points of failure (SPOF)
-
Service-recovery goals to improve the ability to recover from disruptions or failures
-
User-experience goals to meet specific service-level objectives (SLOs)
-
Metric-driven goals to track progress in mitigating known availability risks and implementing recommended resilience measures
-
Regulatory and compliance goals to demonstrate operational resilience
For more information about some of these goal types and how Amazon and other organizations have used goals during chaos engineering adoption, see Appendix A.
These goals serve as a compelling justification and provide a targeted, actionable approach for driving for chaos engineering adoption. In the beginning, goals serve as a proxy for traditional ROI metrics. The goals offer a compelling rationale when quantifiable resilience ROI calculations might be challenging to obtain. Without such goals early in the adoption, the chaos engineering practice risks failure to demonstrate its effectiveness and gain broader organizational buy-in.