# Operational excellence
<a name="a-operational-excellence"></a>

The Operational Excellence pillar includes the ability to support development and run workloads effectively, gain insight into your operations, and to continuously improve supporting processes and procedures to deliver business value. You can find prescriptive guidance on implementation in the [Operational Excellence Pillar whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html). 

**Topics**
+ [Organization](a-organization.md)
+ [Prepare](a-prepare.md)
+ [Operate](a-operate.md)
+ [Evolve](a-evolve.md)

# Organization
<a name="a-organization"></a>

**Topics**
+ [OPS 1. How do you determine what your priorities are?](ops-01.md)
+ [OPS 2. How do you structure your organization to support your business outcomes?](ops-02.md)
+ [OPS 3. How does your organizational culture support your business outcomes?](ops-03.md)

# OPS 1. How do you determine what your priorities are?
<a name="ops-01"></a>

 Everyone should understand their part in enabling business success. Have shared goals in order to set priorities for resources. This will maximize the benefits of your efforts. 

**Topics**
+ [OPS01-BP01 Evaluate external customer needs](ops_priorities_ext_cust_needs.md)
+ [OPS01-BP02 Evaluate internal customer needs](ops_priorities_int_cust_needs.md)
+ [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md)
+ [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md)
+ [OPS01-BP05 Evaluate threat landscape](ops_priorities_eval_threat_landscape.md)
+ [OPS01-BP06 Evaluate tradeoffs](ops_priorities_eval_tradeoffs.md)
+ [OPS01-BP07 Manage benefits and risks](ops_priorities_manage_risk_benefit.md)

# OPS01-BP01 Evaluate external customer needs
<a name="ops_priorities_ext_cust_needs"></a>

 Involve key stakeholders, including business, development, and operations teams, to determine where to focus efforts on external customer needs. This will ensure that you have a thorough understanding of the operations support that is required to achieve your desired business outcomes. 

 **Common anti-patterns:** 
+  You have decided not to have customer support outside of core business hours, but you haven't reviewed historical support request data. You do not know whether this will have an impact on your customers. 
+  You are developing a new feature but have not engaged your customers to find out if it is desired, if desired in what form, and without experimentation to validate the need and method of delivery. 

 **Benefits of establishing this best practice:** Customers whose needs are satisfied are much more likely to remain customers. Evaluating and understanding external customer needs will inform how you prioritize your efforts to deliver business value. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Understand business needs: Business success is created by shared goals and understanding across stakeholders, including business, development, and operations teams. 
  +  Review business goals, needs, and priorities of external customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of external customers. This ensures that you have a thorough understanding of the operational support that is required to achieve business and customer outcomes. 
  +  Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support your shared business goals across internal and external customers. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Well-Architected Framework Concepts – Feedback loop](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.feedback-loop.en.html) 

# OPS01-BP02 Evaluate internal customer needs
<a name="ops_priorities_int_cust_needs"></a>

 Involve key stakeholders, including business, development, and operations teams, when determining where to focus efforts on internal customer needs. This will ensure that you have a thorough understanding of the operations support that is required to achieve business outcomes. 

 Use your established priorities to focus your improvement efforts where they will have the greatest impact (for example, developing team skills, improving workload performance, reducing costs, automating runbooks, or enhancing monitoring). Update your priorities as needs change. 

 **Common anti-patterns:** 
+  You have decided to change IP address allocations for your product teams, without consulting them, to make managing your network easier. You do not know the impact this will have on your product teams. 
+  You are implementing a new development tool but have not engaged your internal customers to find out if it is needed or if it is compatible with their existing practices. 
+  You are implementing a new monitoring system but have not contacted your internal customers to find out if they have monitoring or reporting needs that should be considered. 

 **Benefits of establishing this best practice:** Evaluating and understanding internal customer needs will inform how you prioritize your efforts to deliver business value. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Understand business needs: Business success is created by shared goals and understanding across stakeholders including business, development, and operations teams. 
  +  Review business goals, needs, and priorities of internal customers: Engage key stakeholders, including business, development, and operations teams, to discuss goals, needs, and priorities of internal customers. This ensures that you have a thorough understanding of the operational support that is required to achieve business and customer outcomes. 
  +  Establish shared understanding: Establish shared understanding of the business functions of the workload, the roles of each of the teams in operating the workload, and how these factors support shared business goals across internal and external customers. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Well-Architected Framework Concepts – Feedback loop](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.feedback-loop.en.html) 

# OPS01-BP03 Evaluate governance requirements
<a name="ops_priorities_governance_reqs"></a>

 Governance is the set of policies, rules, or frameworks that a company uses to achieve its business goals. Governance requirements are generated from within your organization. They can affect the types of technologies you choose or influence the way you operate your workload. Incorporate organizational governance requirements into your workload. Conformance is the ability to demonstrate that you have implemented governance requirements. 

 **Desired outcome:** 
+  Governance requirements are incorporated into the architectural design and operation of your workload. 
+  You can provide proof that you have followed governance requirements. 
+  Governance requirements are regularly reviewed and updated. 

 **Common anti-patterns:** 
+ Your organization mandates that the root account has multi-factor authentication. You failed to implement this requirement and the root account is compromised.
+ During the design of your workload, you choose an instance type that is not approved by the IT department. You are unable to launch your workload and must conduct a redesign.
+ You are required to have a disaster recovery plan. You did not create one and your workload suffers an extended outage.
+  Your team wants to use new instances but your governance requirements have not been updated to allow them. 

 **Benefits of establishing this best practice:** 
+  Following governance requirements aligns your workload with larger organization policies. 
+  Governance requirements reflect industry standards and best practices for your organization. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

Identify governance requirement by working with stakeholders and governance organizations. Include governance requirements into your workload. Be able to demonstrate proof that you’ve followed governance requirements.

 **Customer example** 

 At AnyCompany Retail, the cloud operations team works with stakeholders across the organization to develop governance requirements. For example, they prohibit SSH access into Amazon EC2 instances. If teams need system access, they are required to use AWS Systems Manager Session Manager. The cloud operations team regularly updates governance requirements as new services become available. 

 **Implementation steps** 

1.  Identify the stakeholders for your workload, including any centralized teams. 

1.  Work with stakeholders to identify governance requirements. 

1.  Once you’ve generated a list, prioritize the improvement items, and begin implementing them into your workload. 

   1.  Use services like [AWS Config](https://aws.amazon.com/blogs/industries/best-practices-for-aws-organizations-service-control-policies-in-a-multi-account-environment/) to create governance-as-code and validate that governance requirements are followed. 

   1.  If you use [AWS Organizations](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html), you can leverage Service Control Policies to implement governance requirements. 

1.  Provide documentation that validates the implementation. 

 **Level of effort for the implementation plan:** Medium. Implementing missing governance requirements may result in rework of your workload. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) - Compliance is like governance but comes from outside an organization. 

 **Related documents:** 
+ [AWS Management and Governance Cloud Environment Guide ](https://docs.aws.amazon.com/wellarchitected/latest/management-and-governance-guide/management-and-governance-cloud-environment-guide.html)
+ [ Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment ](https://aws.amazon.com/blogs/industries/best-practices-for-aws-organizations-service-control-policies-in-a-multi-account-environment/)
+ [ Governance in the AWS Cloud: The Right Balance Between Agility and Safety ](https://aws.amazon.com/blogs/apn/governance-in-the-aws-cloud-the-right-balance-between-agility-and-safety/)
+ [ What is Governance, Risk, And Compliance (GRC)? ](https://aws.amazon.com/what-is/grc/)

 **Related videos:** 
+ [AWS Management and Governance: Configuration, Compliance, and Audit - AWS Online Tech Talks ](https://www.youtube.com/watch?v=79ud1ZAaoj0)
+ [AWS re:Inforce 2019: Governance for the Cloud Age (DEM12-R1) ](https://www.youtube.com/watch?v=y3WmHnavuN8)
+ [AWS re:Invent 2020: Achieve compliance as code using AWS Config](https://www.youtube.com/watch?v=m8vTwvbzOfw)
+ [AWS re:Invent 2020: Agile governance on AWS GovCloud (US)](https://www.youtube.com/watch?v=hv6B17eriHQ)

 **Related examples:** 
+ [AWS Config Conformance Pack Samples ](https://docs.aws.amazon.com/config/latest/developerguide/conformancepack-sample-templates.html)

 **Related services:** 
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Organizations - Service Control Policies ](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html)

# OPS01-BP04 Evaluate compliance requirements
<a name="ops_priorities_compliance_reqs"></a>

Regulatory, industry, and internal compliance requirements are an important driver for defining your organization’s priorities. Your compliance framework may preclude you from using specific technologies or geographic locations. Apply due diligence if no external compliance frameworks are identified. Generate audits or reports that validate compliance.

 If you advertise that your product meets specific compliance standards, you must have an internal process for ensuring continuous compliance. Examples of compliance standards include PCI DSS, FedRAMP, and HIPAA. Applicable compliance standards are determined by various factors, such as what types of data the solution stores or transmits and which geographic regions the solution supports. 

 **Desired outcome:** 
+  Regulatory, industry, and internal compliance requirements are incorporated into architectural selection. 
+  You can validate compliance and generate audit reports. 

 **Common anti-patterns:** 
+ Parts of your workload fall under the Payment Card Industry Data Security Standard (PCI-DSS) framework but your workload stores credit cards data unencrypted.
+ Your software developers and architects are unaware of the compliance framework that your organization must adhere to.
+  The yearly Systems and Organizations Control (SOC2) Type II audit is happening soon and you are unable to verify that controls are in place. 

 **Benefits of establishing this best practice:** 
+  Evaluating and understanding the compliance requirements that apply to your workload will inform how you prioritize your efforts to deliver business value. 
+  You choose the right locations and technologies that are congruent with your compliance framework. 
+  Designing your workload for auditability helps you to prove you are adhering to your compliance framework. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing this best practice means that you incorporate compliance requirements into your architecture design process. Your team members are aware of the required compliance framework. You validate compliance in line with the framework. 

 **Customer example** 

 AnyCompany Retail stores credit card information for customers. Developers on the card storage team understand that they need to comply with the PCI-DSS framework. They’ve taken steps to verify that credit card information is stored and accessed securely in line with the PCI-DSS framework. Every year they work with their security team to validate compliance. 

 **Implementation steps** 

1.  Work with your security and governance teams to determine what industry, regulatory, or internal compliance frameworks that your workload must adhere to. Incorporate the compliance frameworks into your workload. 

   1.  Validate continual compliance of AWS resources with services like [AWS Compute Optimizer](https://docs.aws.amazon.com/compute-optimizer/latest/ug/what-is-compute-optimizer.html) and [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html). 

1.  Educate your team members on the compliance requirements so they can operate and evolve the workload in line with them. Compliance requirements should be included in architectural and technological choices. 

1.  Depending on the compliance framework, you may be required to generate an audit or compliance report. Work with your organization to automate this process as much as possible. 

   1.  Use services like [AWS Audit Manager](https://docs.aws.amazon.com/audit-manager/latest/userguide/what-is.html) to generate validate compliance and generate audit reports. 

   1.  You can download AWS security and compliance documents with [AWS Artifact](https://docs.aws.amazon.com/artifact/latest/ug/what-is-aws-artifact.html). 

 **Level of effort for the implementation plan:** Medium. Implementing compliance frameworks can be challenging. Generating audit reports or compliance documents adds additional complexity. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [SEC01-BP03 Identify and validate control objectives](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_securely_operate_control_objectives.html) - Security control objectives are an important part of overall compliance. 
+  [SEC01-BP06 Automate testing and validation of security controls in pipelines](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_securely_operate_test_validate_pipeline.html) - As part of your pipelines, validate security controls. You can also generate compliance documentation for new changes. 
+  [SEC07-BP02 Define data protection controls](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_data_classification_define_protection.html) - Many compliance frameworks have data handling and storage policies based. 
+  [SEC10-BP03 Prepare forensic capabilities](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_incident_response_prepare_forensic.html) - Forensic capabilities can sometimes be used in auditing compliance. 

 **Related documents:** 
+ [AWS Compliance Center ](https://aws.amazon.com/financial-services/security-compliance/compliance-center/)
+ [AWS Compliance Resources ](https://aws.amazon.com/compliance/resources/)
+ [AWS Risk and Compliance Whitepaper ](https://docs.aws.amazon.com/whitepapers/latest/aws-risk-and-compliance/welcome.html)
+ [AWS Shared Responsibility Model ](https://aws.amazon.com/compliance/shared-responsibility-model/)
+ [AWS services in scope by compliance programs ](https://aws.amazon.com/compliance/services-in-scope/)

 **Related videos:** 
+ [AWS re:Invent 2020: Achieve compliance as code using AWS Compute Optimizer](https://www.youtube.com/watch?v=m8vTwvbzOfw)
+ [AWS re:Invent 2021 - Cloud compliance, assurance, and auditing ](https://www.youtube.com/watch?v=pdrYGVgb08Y)
+ [AWS Summit ATL 2022 - Implementing compliance, assurance, and auditing on AWS (COP202) ](https://www.youtube.com/watch?v=i7XrWimhqew)

 **Related examples:** 
+ [ PCI DSS and AWS Foundational Security Best Practices on AWS](https://aws.amazon.com/solutions/partners/compliance-pci-fsbp-remediation/)

 **Related services:** 
+ [AWS Artifact](https://docs.aws.amazon.com/artifact/latest/ug/what-is-aws-artifact.html)
+ [AWS Audit Manager](https://docs.aws.amazon.com/audit-manager/latest/userguide/what-is.html)
+ [AWS Compute Optimizer](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html)
+ [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html)

# OPS01-BP05 Evaluate threat landscape
<a name="ops_priorities_eval_threat_landscape"></a>

 Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats) and maintain current information in a risk registry. Include the impact of risks when determining where to focus efforts. 

 The [Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) emphasizes learning, measuring, and improving. It provides a consistent approach for you to evaluate architectures, and implement designs that will scale over time. AWS provides the [AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/) to help you review your approach prior to development, the state of your workloads prior to production, and the state of your workloads in production. You can compare them to the latest AWS architectural best practices, monitor the overall status of your workloads, and gain insight to potential risks. 

 AWS customers are eligible for a guided Well-Architected Review of their mission-critical workloads to [measure their architectures](https://aws.amazon.com/premiumsupport/programs/) against AWS best practices. Enterprise Support customers are eligible for an [Operations Review](https://aws.amazon.com/premiumsupport/programs/), designed to help them to identify gaps in their approach to operating in the cloud. 

 The cross-team engagement of these reviews helps to establish common understanding of your workloads and how team roles contribute to success. The needs identified through the review can help shape your priorities. 

 [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/) is a tool that provides access to a core set of checks that recommend optimizations that may help shape your priorities. [Business and Enterprise Support customers](https://aws.amazon.com/premiumsupport/plans/) receive access to additional checks focusing on security, reliability, performance, and cost-optimization that can further help shape their priorities. 

 **Common anti-patterns:** 
+  You are using an old version of a software library in your product. You are unaware of security updates to the library for issues that may have unintended impact on your workload. 
+  Your competitor just released a version of their product that addresses many of your customers' complaints about your product. You have not prioritized addressing any of these known issues. 
+  Regulators have been pursuing companies like yours that are not compliant with legal regulatory compliance requirements. You have not prioritized addressing any of your outstanding compliance requirements. 

 **Benefits of establishing this best practice:** Identifying and understanding the threats to your organization and workload helps your determination of which threats to address, their priority, and the resources necessary to do so. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Evaluate threat landscape: Evaluate threats to the business (for example, competition, business risk and liabilities, operational risks, and information security threats), so that you can include their impact when determining where to focus efforts. 
  +  [AWS Latest Security Bulletins](https://aws.amazon.com/security/security-bulletins/) 
  +  [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/trustedadvisor/) 
  +  Maintain a threat model: Establish and maintain a threat model identifying potential threats, planned and in place mitigations, and their priority. Review the probability of threats manifesting as incidents, the cost to recover from those incidents and the expected harm caused, and the cost to prevent those incidents. Revise priorities as the contents of the threat model change. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Cloud Compliance](https://aws.amazon.com/compliance/) 
+  [AWS Latest Security Bulletins](https://aws.amazon.com/security/security-bulletins/) 
+  [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/trustedadvisor/) 

# OPS01-BP06 Evaluate tradeoffs
<a name="ops_priorities_eval_tradeoffs"></a>

 Evaluate the impact of tradeoffs between competing interests or alternative approaches, to help make informed decisions when determining where to focus efforts or choosing a course of action. For example, accelerating speed to market for new features may be emphasized over cost optimization, or you may choose a relational database for non-relational data to simplify the effort to migrate a system, rather than migrating to a database optimized for your data type and updating your application. 

 AWS can help you educate your teams about AWS and its services to increase their understanding of how their choices can have an impact on your workload. You should use the resources provided by [AWS Support](https://aws.amazon.com/premiumsupport/programs/) ([AWS Knowledge Center](https://aws.amazon.com/premiumsupport/knowledge-center/), [AWS Discussion Forums](https://forums.aws.amazon.com/index.jspa), and [AWS Support Center](https://console.aws.amazon.com/support/home/)) and [AWS Documentation](https://docs.aws.amazon.com/) to educate your teams. Reach out to AWS Support through AWS Support Center for help with your AWS questions. 

 AWS also shares best practices and patterns that we have learned through the operation of AWS in [The Amazon Builders' Library](https://aws.amazon.com/builders-library/). A wide variety of other useful information is available through the [AWS Blog](https://aws.amazon.com/blogs/) and [The Official AWS Podcast](https://aws.amazon.com/podcasts/aws-podcast/). 

 **Common anti-patterns:** 
+  You are using a relational database to manage time series and non-relational data. There are database options that are optimized to support the data types you are using but you are unaware of the benefits because you have not evaluated the tradeoffs between solutions. 
+  Your investors request that you demonstrate compliance with Payment Card Industry Data Security Standards (PCI DSS). You do not consider the tradeoffs between satisfying their request and continuing with your current development efforts. Instead you proceed with your development efforts without demonstrating compliance. Your investors stop their support of your company over concerns about the security of your platform and their investments. 

 **Benefits of establishing this best practice:** Understanding the implications and consequences of your choices helps you to prioritize your options. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Evaluate tradeoffs: Evaluate the impact of tradeoffs between competing interests, to help make informed decisions when determining where to focus efforts. For example, accelerating speed to market for new features might be emphasized over cost optimization. 
+  AWS can help you educate your teams about AWS and its services to increase their understanding of how their choices can have an impact on your workload. You should use the resources provided by AWS Support (AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center) and AWS Documentation to educate your teams. Reach out to AWS Support through AWS Support Center for help with your AWS questions. 
+  AWS also shares best practices and patterns that we have learned through the operation of AWS in The Amazon Builders' Library. A wide variety of other useful information is available through the AWS Blog and The Official AWS Podcast. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Blog](https://aws.amazon.com/blogs/) 
+  [AWS Cloud Compliance](https://aws.amazon.com/compliance/) 
+  [AWS Discussion Forums](https://forums.aws.amazon.com/index.jspa) 
+  [AWS Documentation](https://docs.aws.amazon.com/) 
+  [AWS Knowledge Center](https://aws.amazon.com/premiumsupport/knowledge-center/) 
+  [AWS Support](https://aws.amazon.com/premiumsupport/) 
+  [AWS Support Center](https://console.aws.amazon.com/support/home/) 
+  [The Amazon Builders' Library](https://aws.amazon.com/builders-library/) 
+  [The Official AWS Podcast](https://aws.amazon.com/podcasts/aws-podcast/) 

# OPS01-BP07 Manage benefits and risks
<a name="ops_priorities_manage_risk_benefit"></a>

 Manage benefits and risks to make informed decisions when determining where to focus efforts. For example, it may be beneficial to deploy a workload with unresolved issues so that significant new features can be made available to customers. It may be possible to mitigate associated risks, or it may become unacceptable to allow a risk to remain, in which case you will take action to address the risk. 

 You might find that you want to emphasize a small subset of your priorities at some point in time. Use a balanced approach over the long term to ensure the development of needed capabilities and management of risk. Update your priorities as needs change 

 **Common anti-patterns:** 
+  You have decided to include a library that does everything you need that one of your developers found on the internet. You have not evaluated the risks of adopting this library from an unknown source and do not know if it contains vulnerabilities or malicious code. 
+  You have decided to develop and deploy a new feature instead of fixing an existing issue. You have not evaluated the risks of leaving the issue in place until the feature is deployed and do not know what the impact will be on your customers. 
+  You have decided to not deploy a feature frequently requested by customers because of unspecified concerns from your compliance team. 

 **Benefits of establishing this best practice:** Identifying the available benefits of your choices, and being aware of the risks to your organization, helps you to make informed decisions. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Manage benefits and risks: Balance the benefits of decisions against the risks involved. 
  +  Identify benefits: Identify benefits based on business goals, needs, and priorities. Examples include time-to-market, security, reliability, performance, and cost. 
  +  Identify risks: Identify risks based on business goals, needs, and priorities. Examples include time-to-market, security, reliability, performance, and cost. 
  +  Assess benefits against risks and make informed decisions: Determine the impact of benefits and risks based on goals, needs, and priorities of your key stakeholders, including business, development, and operations. Evaluate the value of the benefit against the probability of the risk being realized and the cost of its impact. For example, emphasizing speed-to-market over reliability might provide competitive advantage. However, it may result in reduced uptime if there are reliability issues. 

# OPS 2. How do you structure your organization to support your business outcomes?
<a name="ops-02"></a>

 Your teams must understand their part in achieving business outcomes. Teams should understand their roles in the success of other teams, the role of other teams in their success, and have shared goals. Understanding responsibility, ownership, how decisions are made, and who has authority to make decisions will help focus efforts and maximize the benefits from your teams. 

**Topics**
+ [OPS02-BP01 Resources have identified owners](ops_ops_model_def_resource_owners.md)
+ [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md)
+ [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md)
+ [OPS02-BP04 Team members know what they are responsible for](ops_ops_model_know_my_job.md)
+ [OPS02-BP05 Mechanisms exist to identify responsibility and ownership](ops_ops_model_find_owner.md)
+ [OPS02-BP06 Mechanisms exist to request additions, changes, and exceptions](ops_ops_model_req_add_chg_exception.md)
+ [OPS02-BP07 Responsibilities between teams are predefined or negotiated](ops_ops_model_def_neg_team_agreements.md)

# OPS02-BP01 Resources have identified owners
<a name="ops_ops_model_def_resource_owners"></a>

Resources for your workload must have identified owners for change control, troubleshooting, and other functions. Owners are assigned for workloads, accounts, infrastructure, platforms, and applications. Ownership is recorded using tools like a central register or metadata attached to resources. The business value of components informs the processes and procedures applied to them.

 **Desired outcome:** 
+  Resources have identified owners using metadata or a central register. 
+  Team members can identify who owns resources. 
+  Accounts have a single owner where possible. 

 **Common anti-patterns:** 
+  The alternate contacts for your AWS accounts are not populated. 
+  Resources lack tags that identify what teams own them. 
+  You have an ITSM queue without an email mapping. 
+  Two teams have overlapping ownership of a critical piece of infrastructure. 

 **Benefits of establishing this best practice:** 
+  Change control for resources is straightforward with assigned ownership. 
+  You can involve the right owners when troubleshooting issues. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Define what ownership means for the resource use cases in your environment. Ownership can mean who oversees changes to the resource, supports the resource during troubleshooting, or who is financially accountable. Specify and record owners for resources, including name, contact information, organization, and team. 

 **Customer example** 

 AnyCompany Retail defines ownership as the team or individual that owns changes and support for resources. They leverage AWS Organizations to manage their AWS accounts. Alternate account contacts are configuring using group inboxes. Each ITSM queue maps to an email alias. Tags identify who own AWS resources. For other platforms and infrastructure, they have a wiki page that identifies ownership and contact information. 

 **Implementation steps** 

1.  Start by defining ownership for your organization. Ownership can imply who owns the risk for the resource, who owns changes to the resource, or who supports the resource when troubleshooting. Ownership could also imply financial or administrative ownership of the resource. 

1.  Use [AWS Organizations](https://aws.amazon.com/organizations/) to manage accounts. You can manage the alternate contacts for your accounts centrally. 

   1.  Using company owned email addresses and phone numbers for contact information helps you to access them even if the individuals whom they belong to are no longer with your organization. For example, create separate email distribution lists for billing, operations, and security and configure these as Billing, Security, and Operations contacts in each active AWS account. Multiple people will receive AWS notifications and be able to respond, even if someone is on vacation, changes roles, or leaves the company. 

   1.  If an account is not managed by [AWS Organizations](https://aws.amazon.com/organizations/), alternate account contacts help AWS get in contact with the appropriate personnel if needed. Configure the account’s alternate contacts to point to a group rather than an individual. 

1.  Use tags to identify owners for AWS resources. You can specify both owners and their contact information in separate tags. 

   1.  You can use [AWS Config](https://aws.amazon.com/config/) rules to enforce that resources have the required ownership tags. 

   1.  For in-depth guidance on how to build a tagging strategy for your organization, see [AWS Tagging Best Practices whitepaper](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/tagging-best-practices.html). 

1.  For other resources, platforms, and infrastructure, create documentation that identifies ownership. This should be accessible to all team members. 

 **Level of effort for the implementation plan:** Low. Leverage account contact information and tags to assign ownership of AWS resources. For other resources you can use something as simple as a table in a wiki to record ownership and contact information, or use an ITSM tool to map ownership. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) - The processes and procedures to support resources depends on resource ownership. 
+  [OPS02-BP04 Team members know what they are responsible for](ops_ops_model_know_my_job.md) - Team members should understand what resources they are owners of. 
+  [OPS02-BP05 Mechanisms exist to identify responsibility and ownership](ops_ops_model_find_owner.md) - Ownership needs to be discoverable using mechanisms like tags or account contacts. 

 **Related documents:** 
+ [AWS Account Management - Updating contact information ](https://docs.aws.amazon.com/accounts/latest/reference/manage-acct-update-contact.html#manage-acct-update-contact-alternate-edit.html)
+ [AWS Config Rules - required-tags ](https://docs.aws.amazon.com/config/latest/developerguide/required-tags.html)
+ [AWS Organizations - Updating alternative contacts in your organization ](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_accounts_update_contacts.html)
+  [AWS Tagging Best Practices whitepaper](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/tagging-best-practices.html) 

 **Related examples:** 
+ [AWS Config Rules - Amazon EC2 with required tags and valid values ](https://github.com/awslabs/aws-config-rules/blob/master/python/ec2_require_tags_with_valid_values.py)

 **Related services:** 
+  [AWS Config](https://aws.amazon.com/config/) 
+  [AWS Organizations](https://aws.amazon.com/organizations/) 

# OPS02-BP02 Processes and procedures have identified owners
<a name="ops_ops_model_def_proc_owners"></a>

 Understand who has ownership of the definition of individual processes and procedures, why those specific process and procedures are used, and why that ownership exists. Understanding the reasons that specific processes and procedures are used aids in identification of improvement opportunities. 

 **Benefits of establishing this best practice:** Understanding ownership identifies who can approve improvements, implement those improvements, or both. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Process and procedures have identified owners responsible for their definition: Capture the processes and procedures used in your environment and the individual or team responsible for their definition. 
  +  Identify process and procedures: Identify the operations activities conducted in support of your workloads. Document these activities in a discoverable location. 
  +  Define who owns the definition of a process or procedure: Uniquely identify the individual or team responsible for the specification of an activity. They are responsible to ensure it can be successfully performed by an adequately skilled team member with the correct permissions, access, and tools. If there are issues with performing that activity, the team members performing it are responsible to provide the detailed feedback necessary for the activitiy to be improved. 
  +  Capture ownership in the metadata of the activity artifact: Procedures automated in services like AWS Systems Manager, through documents, and AWS Lambda, as functions, support capturing metadata information as tags. Capture resource ownership using tags or resource groups, specifying ownership and contact information. Use AWS Organizations to create tagging polices and ensure ownership and contact information are captured. 

# OPS02-BP03 Operations activities have identified owners responsible for their performance
<a name="ops_ops_model_def_activity_owners"></a>

 Understand who has responsibility to perform specific activities on defined workloads and why that responsibility exists. Understanding who has responsibility to perform activities informs who will conduct the activity, validate the result, and provide feedback to the owner of the activity. 

 **Benefits of establishing this best practice:** Understanding who is responsible to perform an activity informs whom to notify when action is needed and who will perform the action, validate the result, and provide feedback to the owner of the activity. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Operations activities have identified owners responsible for their performance: Capture the responsibility for performing processes and procedures used in your environment 
  +  Identify process and procedures: Identify the operations activities conducted in support of your workloads. Document these activities in a discoverable location. 
  +  Define who is responsible to perform each activity: Identify the team responsible for an activity. Ensure they have the details of the activity, and the necessary skills and correct permissions, access, and tools to perform the activity. They must understand the condition under which it is to be performed (for example, on an event or schedule). Make this information discoverable so that members of your organization can identify who they need to contact, team or individual, for specific needs. 

# OPS02-BP04 Team members know what they are responsible for
<a name="ops_ops_model_know_my_job"></a>

 Understanding the responsibilities of your role and how you contribute to business outcomes informs the prioritization of your tasks and why your role is important. This helps team members to recognize needs and respond appropriately. 

 **Benefits of establishing this best practice:** Understanding your responsibilities informs the decisions you make, the actions you take, and your hand off activities to their proper owners. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Ensure team members understand their roles and responsibilities: Identify team members roles and responsibilities and ensure they understand the expectations of their role. Make this information discoverable so that members of your organization can identify who they need to contact, team or individual, for specific needs. 

# OPS02-BP05 Mechanisms exist to identify responsibility and ownership
<a name="ops_ops_model_find_owner"></a>

 Where no individual or team is identified, there are defined escalation paths to someone with the authority to assign ownership or plan for that need to be addressed. 

 **Benefits of establishing this best practice:** Understanding who has responsbility or ownership allows you to reach out to the proper team or team member to make a request or transition a task. Having an identified person who has the authority to assign responsbility or ownership or plan to address needs reduces the risk of inaction and needs not being addressed. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Mechanisms exist to identify responsibility and ownership: Provide accessible mechanisms for members of your organization to discover and identify ownership and responsibility. These mechanisms will help them to identify who to contact, team or individual, for specific needs. 

# OPS02-BP06 Mechanisms exist to request additions, changes, and exceptions
<a name="ops_ops_model_req_add_chg_exception"></a>

You can make requests to owners of processes, procedures, and resources. Requests include additions, changes, and exceptions. These requests go through a change management process. Make informed decisions to approve requests where viable and determined to be appropriate after an evaluation of benefits and risks. 

 **Desired outcome:** 
+  You can make requests to change processes, procedures, and resources based on assigned ownership. 
+  Changes are made in a deliberate manner, weighing benefits and risks. 

 **Common anti-patterns:** 
+  You must update the way you deploy your application, but there is no way to request a change to the deployment process from the operations team. 
+  The disaster recovery plan must be updated, but there is no identified owner to request changes to. 

 **Benefits of establishing this best practice:** 
+  Processes, procedures, and resources can evolve as requirements change. 
+  Owners can make informed decisions when to make changes. 
+  Changes are made in a deliberate manner. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 To implement this best practice, you need to be able to request changes to processes, procedures, and resources. The change management process can be lightweight. Document the change management process. 

 **Customer example** 

 AnyCompany Retail uses a responsibility assignment (RACI) matrix to identify who owns changes for processes, procedures, and resources. They have a documented change management process that’s lightweight and easy to follow. Using the RACI matrix and the process, anyone can submit change requests. 

 **Implementation steps** 

1.  Identify the processes, procedures, and resources for your workload and the owners for each. Document them in your knowledge management system. 

   1.  If you have not implemented [OPS02-BP01 Resources have identified owners](ops_ops_model_def_resource_owners.md), [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md), or [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md), start with those first. 

1.  Work with stakeholders in your organization to develop a change management process. The process should cover additions, changes, and exceptions for resources, processes, and procedures. 

   1.  You can use [AWS Systems Manager Change Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/change-manager.html) as a change management platform for workload resources. 

1.  Document the change management process in your knowledge management system. 

 **Level of effort for the implementation plan:** Medium. Developing a change management process requires alignment with multiple stakeholders across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP01 Resources have identified owners](ops_ops_model_def_resource_owners.md) - Resources need identified owners before you build a change management process. 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) - Processes need identified owners before you build a change management process. 
+  [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md) - Operations activities need identified owners before you build a change management process. 

 **Related documents:** 
+ [AWS Prescriptive Guidance - Foundation palybook for AWS large migrations: Creating RACI matrices ](https://docs.aws.amazon.com/prescriptive-guidance/latest/large-migration-foundation-playbook/team-org.html#raci)
+ [ Change Management in the Cloud Whitepaper ](https://docs.aws.amazon.com/whitepapers/latest/change-management-in-the-cloud/change-management-in-the-cloud.html)

 **Related services:** 
+ [AWS Systems Manager Change Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/change-manager.html)

# OPS02-BP07 Responsibilities between teams are predefined or negotiated
<a name="ops_ops_model_def_neg_team_agreements"></a>

Have defined or negotiated agreements between teams describing how they work with and support each other (for example, response times, service level objectives, or service-level agreements). Inter-team communications channels are documented. Understanding the impact of the teams’ work on business outcomes and the outcomes of other teams and organizations informs the prioritization of their tasks and helps them respond appropriately. 

 When responsibility and ownership are undefined or unknown, you are at risk of both not addressing necessary activities in a timely fashion and of redundant and potentially conflicting efforts emerging to address those needs. 

 **Desired outcome:** 
+  Inter-team working or support agreements are agreed to and documented. 
+  Teams that support or work with each other have defined communication channels and response expectations. 

 **Common anti-patterns:** 
+  An issue occurs in production and two separate teams start troubleshooting independent of each other. Their siloed efforts extend the outage. 
+  The operations team needs assistance from the development team but there is no agreed to response time. The request is stuck in the backlog. 

 **Benefits of establishing this best practice:** 
+  Teams know how to interact and support each other. 
+  Expectations for responsiveness are known. 
+  Communications channels are clearly defined. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing this best practice means that there is no ambiguity about how teams work with each other. Formal agreements codify how teams work together or support each other. Inter-team communication channels are documented. 

 **Customer example** 

 AnyCompany Retail’s SRE team has a service level agreement with their development team. Whenever the development team makes a request in their ticketing system, they can expect a response within fifteen minutes. If there is a site outage, the SRE team takes lead in the investigation with support from the development team. 

 **Implementation steps** 

1.  Working with stakeholders across your organization, develop agreements between teams based on processes and procedures. 

   1.  If a process or procedure is shared between two teams, develop a runbook on how the teams will work together. 

   1.  If there are dependencies between teams, agree to a response SLA for requests. 

1.  Document responsibilities in your knowledge management system. 

 **Level of effort for the implementation plan:** Medium. If there are no existing agreements between teams, it can take effort to come to agreement with stakeholders across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) - Process ownership must be identified before setting agreements between teams. 
+  [OPS02-BP03 Operations activities have identified owners responsible for their performance](ops_ops_model_def_activity_owners.md) - Operations activities ownership must be identified before setting agreements between teams. 

 **Related documents:** 
+ [AWS Executive Insights - Empowering Innovation with the Two-Pizza Team ](https://aws.amazon.com/executive-insights/content/amazon-two-pizza-team/)
+ [ Introduction to DevOps on AWS - Two-Pizza Teams ](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/two-pizza-teams.html)

# OPS 3. How does your organizational culture support your business outcomes?
<a name="ops-03"></a>

 Provide support for your team members so that they can be more effective in taking action and supporting your business outcome. 

**Topics**
+ [OPS03-BP01 Executive Sponsorship](ops_org_culture_executive_sponsor.md)
+ [OPS03-BP02 Team members are empowered to take action when outcomes are at risk](ops_org_culture_team_emp_take_action.md)
+ [OPS03-BP03 Escalation is encouraged](ops_org_culture_team_enc_escalation.md)
+ [OPS03-BP04 Communications are timely, clear, and actionable](ops_org_culture_effective_comms.md)
+ [OPS03-BP05 Experimentation is encouraged](ops_org_culture_team_enc_experiment.md)
+ [OPS03-BP06 Team members are encouraged to maintain and grow their skill sets](ops_org_culture_team_enc_learn.md)
+ [OPS03-BP07 Resource teams appropriately](ops_org_culture_team_res_appro.md)
+ [OPS03-BP08 Diverse opinions are encouraged and sought within and across teams](ops_org_culture_diverse_inc_access.md)

# OPS03-BP01 Executive Sponsorship
<a name="ops_org_culture_executive_sponsor"></a>

 Senior leadership clearly sets expectations for the organization and evaluates success. Senior leadership is the sponsor, advocate, and driver for the adoption of best practices and evolution of the organization 

 **Benefits of establishing this best practice:** Engaged leadership, clearly communicated expectations, and shared goals ensures that team members know what is expected of them. Evaluating success aids in identification of barriers to success so that they can be addressed through intervention by the sponsor advocate or their delegates. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Executive Sponsorship: Senior leadership clearly sets expectations for the organization and evaluates success. Senior leadership is the sponsor, advocate, and driver for the adoption of best practices and evolution of the organization 
  +  Set expectations: Define and publish goals for your organizations including how they will be measured. 
  +  Track achievement of goals: Measure the incremental achievement of goals regularly and share the results so that appropriate action can be taken if outcomes are at risk. 
  +  Provide the resources necessary to achieve your goals: Regularly review if resources are still appropriate, of if additional resources are needed based on: new information, changes to goals, responsibilities, or your business environment. 
  +  Advocate for your teams: Remain engaged with your teams so that you understand how they are doing and if there are external factors affecting them. When your teams are impacted by external factors, reevaluate goals and adjust targets as appropriate. Identify obstacles that are impeding your teams progress. Act on behalf of your teams to help address obstacles and remove unnecessary burdens. 
  +  Be a driver for adoption of best practices: Acknowledge best practices that provide quantifiable benefits and recognize the creators and adopters. Encourage further adoption to magnify the benefits achieved. 
  +  Be a driver for evolution of for your teams: Create a culture of continual improvement. Encourage both personal and organizational growth and development. Provide long term targets to strive for that will require incremental achievement over time. Adjust this vision to compliment your needs, business goals, and business environment as they change. 

# OPS03-BP02 Team members are empowered to take action when outcomes are at risk
<a name="ops_org_culture_team_emp_take_action"></a>

 The workload owner has defined guidance and scope empowering team members to respond when outcomes are at risk. Escalation mechanisms are used to get direction when events are outside of the defined scope. 

 **Benefits of establishing this best practice:** By testing and validating changes early, you are able to address issues with minimized costs and limit the impact on your customers. By testing prior to deployment you minimize the introduction of errors. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Team members are empowered to take action when outcomes are at risk: Provide your team members the permissions, tools, and opportunity to practice the skills necessary to respond effectively. 
  +  Give your team members opportunity to practice the skills necessary to respond: Provide alternative safe environments where processes and procedures can be tested and trained upon safely. Perform game days to allow team members to gain experience responding to real world incidents in simulated and safe environments. 
  +  Define and acknowledge team members' authority to take action: Specifically define team members authority to take action by assigning permissions and access to the workloads and components they support. Acknowledge that they are empowered to take action when outcomes are at risk. 

# OPS03-BP03 Escalation is encouraged
<a name="ops_org_culture_team_enc_escalation"></a>

 Team members have mechanisms and are encouraged to escalate concerns to decision makers and stakeholders if they believe outcomes are at risk. Escalation should be performed early and often so that risks can be identified, and prevented from causing incidents. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Encourage early and frequent escalation: Organizationally acknowledge that escalation early and often is the best practice. Organizationally acknowledge and accept that escalations may prove to be unfounded, and that it is better to have the opportunity to prevent an incident then to miss that opportunity by not escalating. 
  +  Have a mechanism for escalation: Have documented procedures defining when and how escalation should occur. Document the series of people with increasing authority to take action or approve action and their contact information. Escalation should continue until the team member is satisfied that they have handed off the risk to a person able to address it, or they have contacted the person who owns the risk and liability for the operation of the workload. It is that person who ultimately owns all decisions with respect to their workload. Escalations should include the nature of the risk, the criticality of the workload, who is impacted, what the impact is, and the urgency, that is, when is the impact expected. 
  +  Protect employees who escalate: Have policy that protects team members from retribution if they escalate around a non-responsive decision maker or stakeholder. Have mechanisms in place to identify if this is occurring and respond appropriately. 

# OPS03-BP04 Communications are timely, clear, and actionable
<a name="ops_org_culture_effective_comms"></a>

 Mechanisms exist and are used to provide timely notice to team members of known risks and planned events. Necessary context, details, and time (when possible) are provided to support determining if action is necessary, what action is required, and to take action in a timely manner. For example, providing notice of software vulnerabilities so that patching can be expedited, or providing notice of planned sales promotions so that a change freeze can be implemented to avoid the risk of service disruption. Planned events can be recorded in a change calendar or maintenance schedule so that team members can identify what activities are pending. 

 **Desired outcome:** 
+  Communications provide context, details, and time expectations. 
+  Team members have a clear understanding of when and how to act in response to communications. 
+  Leverage change calendars to provide notice of expected changes. 

 **Common anti-patterns:** 
+  An alert happens several times per week that is a false positive. You mute the notification each time it happens. 
+  You are asked to make a change to your security groups but are not given an expectation of when it should happen. 
+  You receive constant notifications in chat when systems scale up but no action is necessary. You avoid the chat channel and miss an important notification. 
+  A change is made to production without informing the operations team. The change creates an alert and the on-call team is activated. 

 **Benefits of establishing this best practice:** 
+  Your organization avoids alert fatigue. 
+  Team members can act with the necessary context and expectations. 
+  Changes can be made during change windows, reducing risk. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To implement this best practice, you must work with stakeholders across your organization to agree to communication standards. Publicize those standards to your organization. Identify and remove alerts that are false-positive or always on. Utilize change calendars so team members know when actions can be taken and what activities are pending. Verify that communications lead to clear actions with necessary context. 

 **Customer example** 

 AnyCompany Retail uses chat as their main communication medium. Alerts and other information populate specific channels. When someone must act, the desired outcome is clearly stated, and in many cases, they are given a runbook or playbook to use. They use a change calendar to schedule major changes to production systems. 

 **Implementation steps** 

1.  Analyze your alerts for false-positives or alerts that are constantly created. Remove or change them so that they start when human intervention is required. If an alert is initiated, provide a runbook or playbook. 

   1.  You can use [AWS Systems Manager Documents](https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-ssm-docs.html) to build playbooks and runbooks for alerts. 

1.  Mechanisms are in place to provide notification of risks or planned events in a clear and actionable way with enough notice to allow appropriate responses. Use email lists or chat channels to send notifications ahead of planned events. 

   1.  [Amazon Q Developer in chat applications](https://docs.aws.amazon.com/chatbot/latest/adminguide/what-is.html) can be used to send alerts and respond to events within your organizations messaging platform. 

1.  Provide an accessible source of information where planned events can be discovered. Provide notifications of planned events from the same system. 

   1.  [AWS Systems Manager Change Calendar](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-change-calendar.html) can be used to create change windows when changes can occur. This provides team members notice when they can make changes safely. 

1.  Monitor vulnerability notifications and patch information to understand vulnerabilities in the wild and potential risks associated to your workload components. Provide notification to team members so that they can act. 

   1.  You can subscribe to [AWS Security Bulletins](https://aws.amazon.com/security/security-bulletins/) to receive notifications of vulnerabilities on AWS. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md) - Make communications actionable by supplying a runbook when the outcome is known. 
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md) - In the case where the outcome is unknown, playbooks can make communications actionable. 

 **Related documents:** 
+ [AWS Security Bulletins ](https://aws.amazon.com/security/security-bulletins)
+ [ Open CVE ](https://www.opencve.io/welcome)

 **Related examples:** 
+ [ Well-Architected Labs: Inventory and Patch Management (Level 100) ](https://wellarchitectedlabs.com/operational-excellence/100_labs/100_inventory_patch_management/)

 **Related services:** 
+ [Amazon Q Developer in chat applications](https://docs.aws.amazon.com/chatbot/latest/adminguide/what-is.html)
+ [AWS Systems Manager Change Calendar](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-change-calendar.html)
+ [AWS Systems Manager Documents](https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-ssm-docs.html)

# OPS03-BP05 Experimentation is encouraged
<a name="ops_org_culture_team_enc_experiment"></a>

Experimentation is a catalyst for turning new ideas into products and features. It accelerates learning and keeps team members interested and engaged. Team members are encouraged to experiment often to drive innovation. Even when an undesired result occurs, there is value in knowing what not to do. Team members are not punished for successful experiments with undesired results. 

 **Desired outcome:** 
+  Your organization encourages experimentation to foster innovation. 
+  Experiments are used as an opportunity to learn. 

 **Common anti-patterns:** 
+  You want to run an A/B test but there is no mechanism to run the experiment. You deploy a UI change without the ability to test it. It results in a negative customer experience. 
+  Your company only has a stage and production environment. There is no sandbox environment to experiment with new features or products so you must experiment within the production environment. 

 **Benefits of establishing this best practice:** 
+  Experimentation drives innovation. 
+  You can react faster to feedback from users through experimentation. 
+  Your organization develops a culture of learning. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Experiments should be run in a safe manner. Leverage multiple environments to experiment without jeopardizing production resources. Use A/B testing and feature flags to test experiments. Provide team members the ability to conduct experiments in a sandbox environment. 

 **Customer example** 

 AnyCompany Retail encourages experimentation. Team members can use 20% of their work week to experiment or learn new technologies. They have a sandbox environment where they can innovate. A/B testing is used for new features to validate them with real user feedback. 

 **Implementation steps** 

1.  Work with leadership across your organization to support experimentation. Team members should be encouraged to conduct experiments in a safe manner. 

1.  Provide your team members with an environment where they can safely experiment. They must have access to an environment that is like production. 

   1.  You can use a separate AWS account to create a sandbox environment for experimentation. [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html) can be used to provision these accounts. 

1.  Use feature flags and A/B testing to experiment safely and gather user feedback. 

   1.  [AWS AppConfig Feature Flags](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) provides the ability to create feature flags. 

   1.  [Amazon CloudWatch Evidently](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Evidently.html) can be used to run A/B tests over a limited deployment. 

   1.  You can use [AWS Lambda versions](https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html) to deploy a new version of a function for beta testing. 

 **Level of effort for the implementation plan:** High. Providing team members with an environment to experiment in and a safe way to conduct experiments can require significant investment. You may also need to modify application code to use feature flags or support A/B testing. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) - Learning from incidents is an important driver for innovation along with experimentation. 
+  [OPS11-BP03 Implement feedback loops](ops_evolve_ops_feedback_loops.md) - Feedback loops are an important part of experimentation. 

 **Related documents:** 
+ [ An Inside Look at the Amazon Culture: Experimentation, Failure, and Customer Obsession ](https://aws.amazon.com/blogs/industries/an-inside-look-at-the-amazon-culture-experimentation-failure-and-customer-obsession/)
+ [ Best practices for creating and managing sandbox accounts in AWS](https://aws.amazon.com/blogs/mt/best-practices-creating-managing-sandbox-accounts-aws/)
+ [ Create a Culture of Experimentation Enabled by the Cloud ](https://aws.amazon.com/blogs/enterprise-strategy/create-a-culture-of-experimentation-enabled-by-the-cloud/)
+ [ Enabling experimentation and innovation in the cloud at SulAmérica Seguros ](https://aws.amazon.com/blogs/mt/enabling-experimentation-and-innovation-in-the-cloud-at-sulamerica-seguros/)
+ [ Experiment More, Fail Less ](https://aws.amazon.com/blogs/enterprise-strategy/experiment-more-fail-less/)
+ [ Organizing Your AWS Environment Using Multiple Accounts - Sandbox OU ](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/sandbox-ou.html)
+ [ Using AWS AppConfig Feature Flags ](https://aws.amazon.com/blogs/mt/using-aws-appconfig-feature-flags/)

 **Related videos:** 
+ [AWS On Air ft. Amazon CloudWatch Evidently \$1 AWS Events ](https://www.youtube.com/watch?v=ydX7lRNKAOo)
+ [AWS On Air San Fran Summit 2022 ft. AWS AppConfig Feature Flags integration with Jira ](https://www.youtube.com/watch?v=miAkZPtjqHg)
+ [AWS re:Invent 2022 - A deployment is not a release: Control your launches w/feature flags (BOA305-R) ](https://www.youtube.com/watch?v=uouw9QxVrE8)
+ [ Programmatically Create an AWS account with AWS Control Tower](https://www.youtube.com/watch?v=LxxQTPdSFgw)
+ [ Set Up a Multi-Account AWS Environment that Uses Best Practices for AWS Organizations](https://www.youtube.com/watch?v=uOrq8ZUuaAQ)

 **Related examples:** 
+ [AWS Innovation Sandbox ](https://aws.amazon.com/solutions/implementations/aws-innovation-sandbox/)
+ [ End-to-end Personalization 101 for E-Commerce ](https://catalog.workshops.aws/personalize-101-ecommerce/en-US/labs/ab-testing)

 **Related services:** 
+  [Amazon CloudWatch Evidently](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Evidently.html) 
+  [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) 
+  [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html) 

# OPS03-BP06 Team members are encouraged to maintain and grow their skill sets
<a name="ops_org_culture_team_enc_learn"></a>

 Teams must grow their skill sets to adopt new technologies, and to support changes in demand and responsibilities in support of your workloads. Growth of skills in new technologies is frequently a source of team member satisfaction and supports innovation. Support your team members’ pursuit and maintenance of industry certifications that validate and acknowledge their growing skills. Cross train to promote knowledge transfer and reduce the risk of significant impact when you lose skilled and experienced team members with institutional knowledge. Provide dedicated structured time for learning. 

 AWS provides resources, including the [AWS Getting Started Resource Center](https://aws.amazon.com/getting-started/), [AWS Blogs](https://aws.amazon.com/blogs/), [AWS Online Tech Talks](https://aws.amazon.com/getting-started/), [AWS Events and Webinars](https://aws.amazon.com/events/), and the [AWS Well-Architected Labs](https://wellarchitectedlabs.com/), that provide guidance, examples, and detailed walkthroughs to educate your teams. 

 AWS also shares best practices and patterns that we have learned through the operation of AWS in [The Amazon Builders' Library](https://aws.amazon.com/builders-library/) and a wide variety of other useful educational material through the [AWS Blog](https://aws.amazon.com/blogs/) and [The Official AWS Podcast](https://aws.amazon.com/podcasts/aws-podcast/). 

 You should take advantage of the education resources provided by AWS such as the Well-Architected labs, [AWS Support](https://aws.amazon.com/premiumsupport/programs/) ([AWS Knowledge Center](https://aws.amazon.com/premiumsupport/knowledge-center/), [AWS Discussion Forms](https://forums.aws.amazon.com/index.jspa), and [AWS Support Center](https://console.aws.amazon.com/support/home/)) and [AWS Documentation](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) to educate your teams. Reach out to AWS Support through AWS Support Center for help with your AWS questions. 

 [AWS Training and Certification](https://aws.amazon.com/training/) provides some free training through self-paced digital courses on AWS fundamentals. You can also register for instructor-led training to further support the development of your teams’ AWS skills. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Team members are encouraged to maintain and grow their skill sets: To adopt new technologies, support innovation, and to support changes in demand and responsibilities in support of your workloads continuing education is necessary. 
  +  Provide resources for education: Provided dedicated structured time, access to training materials, lab resources, and support participation in conferences and professional organizations that provide opportunities for learning from both educators and peers. Provide junior team members' access to senior team members as mentors or allow them to shadow their work and be exposed to their methods and skills. Encourage learning about content not directly related to work in order to have a broader perspective. 
  +  Team education and cross-team engagement: Plan for the continuing education needs of your team members. Provide opportunities for team members to join other teams (temporarily or permanently) to share skills and best practices benefiting your entire organization 
  +  Support pursuit and maintenance of industry certifications: Support your team members acquiring and maintaining industry certifications that validate what they have learned, and acknowledge their accomplishments. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Getting Started Resource Center](https://aws.amazon.com/getting-started/) 
+  [AWS Blogs](https://aws.amazon.com/blogs/) 
+  [AWS Cloud Compliance](https://aws.amazon.com/compliance/) 
+  [AWS Discussion Forms](https://forums.aws.amazon.com/index.jspa) 
+  [AWS Documentation](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+  [AWS Online Tech Talks](https://aws.amazon.com/getting-started/) 
+  [AWS Events and Webinars](https://aws.amazon.com/events/) 
+  [AWS Knowledge Center](https://aws.amazon.com/premiumsupport/knowledge-center/) 
+  [AWS Support](https://aws.amazon.com/premiumsupport/programs/) 
+  [AWS Training and Certification](https://aws.amazon.com/training/) 
+  [AWS Well-Architected Labs](https://wellarchitectedlabs.com/), 
+  [The Amazon Builders' Library](https://aws.amazon.com/builders-library/) 
+  [The Official AWS Podcast](https://aws.amazon.com/podcasts/aws-podcast/). 

# OPS03-BP07 Resource teams appropriately
<a name="ops_org_culture_team_res_appro"></a>

 Maintain team member capacity, and provide tools and resources to support your workload needs. Overtasking team members increases the risk of incidents resulting from human error. Investments in tools and resources (for example, providing automation for frequently performed activities) can scale the effectiveness of your team, helping them to support additional activities. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Resource teams appropriately: Ensure you have an understanding of the success of your teams and the factors that contribute to their success or lack of success. Act to support teams with appropriate resources. 
  +  Understand team performance: Measure the achievement of operational outcomes and the development of assets by your teams. Track changes in output and error rate over time. Engage with teams to understand the work related challenges that impact them (for example, increasing responsibilities, changes in technology, loss of personnel, or increase in customers supported). 
  +  Understand impacts on team performance: Remain engaged with your teams so that you understand how they are doing and if there are external factors affecting them. When your teams are impacted by external factors, reevaluate goals and adjust targets as appropriate. Identify obstacles that are impeding your teams progress. Act on behalf of your teams to help address obstacles and remove unnecessary burdens. 
  +  Provide the resources necessary for teams to be successful: Regularly review if resources are still appropriate, of if additional resources are needed, and make appropriate adjustments to support teams. 

# OPS03-BP08 Diverse opinions are encouraged and sought within and across teams
<a name="ops_org_culture_diverse_inc_access"></a>

 Leverage cross-organizational diversity to seek multiple unique perspectives. Use this perspective to increase innovation, challenge your assumptions, and reduce the risk of confirmation bias. Grow inclusion, diversity, and accessibility within your teams to gain beneficial perspectives. 

 Organizational culture has a direct impact on team member job satisfaction and retention. Foster the engagement and capabilities of your team members to create the success of your business. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Seek diverse opinions and perspectives: Encourage contributions from everyone. Give voice to under-represented groups. Rotate roles and responsibilities in meetings. 
  +  Expand roles and responsibilities: Provide opportunity for team members to take on roles that they might not otherwise. They will gain experience and perspective from the role, and from interactions with new team members with whom they might not otherwise interact. They will bring their experience and perspective to the new role and team members they interact with. As perspective increases, additional business opportunities may emerge, or new opportunities for improvement may be identified. Have members within a team take turns at common tasks that others typically perform to understand the demands and impact of performing them. 
  +  Provide a safe and welcoming environment: Have policy and controls that protect team members' mental and physical safety within your organization. Team members should be able to interact without fear of reprisal. When team members feel safe and welcome they are more likely to be engaged and productive. The more diverse your organization the better your understanding can be of the people you support including your customers. When your team members are comfortable, feel free to speak, and are confident they will be heard, they are more likely to share valuable insights (for example, marketing opportunities, accessibility needs, unserved market segments, unacknowledged risks in your environment). 
  +  Enable team members to participate fully: Provide the resources necessary for your employees to participate fully in all work related activities. Team members that face daily challenges have developed skills for working around them. These uniquely developed skills can provide significant benefit to your organization. Supporting team members with necessary accommodations will increase the benefits you can receive from their contributions. 

# Prepare
<a name="a-prepare"></a>

**Topics**
+ [OPS 4. How do you implement observability in your workload?](ops-04.md)
+ [OPS 5. How do you reduce defects, ease remediation, and improve flow into production?](ops-05.md)
+ [OPS 6. How do you mitigate deployment risks?](ops-06.md)
+ [OPS 7. How do you know that you are ready to support a workload?](ops-07.md)

# OPS 4. How do you implement observability in your workload?
<a name="ops-04"></a>

Implement observability in your workload so that you can understand its state and make data-driven decisions based on business requirements.

**Topics**
+ [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md)
+ [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md)
+ [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md)
+ [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md)
+ [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md)

# OPS04-BP01 Identify key performance indicators
<a name="ops_observability_identify_kpis"></a>

 Implementing observability in your workload starts with understanding its state and making data-driven decisions based on business requirements. One of the most effective ways to ensure alignment between monitoring activities and business objectives is by defining and monitoring key performance indicators (KPIs). 

 **Desired outcome:** Efficient observability practices that are tightly aligned with business objectives, ensuring that monitoring efforts are always in service of tangible business outcomes. 

 **Common anti-patterns:** 
+  Undefined KPIs: Working without clear KPIs can lead to monitoring too much or too little, missing vital signals. 
+  Static KPIs: Not revisiting or refining KPIs as the workload or business objectives evolve. 
+  Misalignment: Focusing on technical metrics that don’t correlate directly with business outcomes or are harder to correlate with real-world issues. 

 **Benefits of establishing this best practice:** 
+  Ease of issue identification: Business KPIs often surface issues more clearly than technical metrics. A dip in a business KPI can pinpoint a problem more effectively than sifting through numerous technical metrics. 
+  Business alignment: Ensures that monitoring activities directly support business objectives. 
+  Efficiency: Prioritize monitoring resources and attention on metrics that matter. 
+  Proactivity: Recognize and address issues before they have broader business implications. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To effectively define workload KPIs: 

1.  **Start with business outcomes:** Before diving into metrics, understand the desired business outcomes. Is it increased sales, higher user engagement, or faster response times? 

1.  **Correlate technical metrics with business objectives:** Not all technical metrics have a direct impact on business outcomes. Identify those that do, but it's often more straightforward to identify an issue using a business KPI. 

1.  **Use [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html):** Employ CloudWatch to define and monitor metrics that represent your KPIs. 

1.  **Regularly review and update KPIs:** As your workload and business evolve, keep your KPIs relevant. 

1.  **Involve stakeholders:** Involve both technical and business teams in defining and reviewing KPIs. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md)
+ [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md)
+ [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md)
+ [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md)

 **Related documents:** 
+ [AWS Observability Best Practices ](https://aws-observability.github.io/observability-best-practices/)
+ [ CloudWatch User Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [AWS Observability Skill Builder Course ](https://explore.skillbuilder.aws/learn/course/external/view/elearning/14688/aws-observability)

 **Related videos:** 
+ [ Developing an observability strategy ](https://www.youtube.com/watch?v=Ub3ATriFapQ)

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) 

# OPS04-BP02 Implement application telemetry
<a name="ops_observability_application_telemetry"></a>

 Application telemetry serves as the foundation for observability of your workload. It's crucial to emit telemetry that offers actionable insights into the state of your application and the achievement of both technical and business outcomes. From troubleshooting to measuring the impact of a new feature or ensuring alignment with business key performance indicators (KPIs), application telemetry informs the way you build, operate, and evolve your workload. 

 Metrics, logs, and traces form the three primary pillars of observability. These serve as diagnostic tools that describe the state of your application. Over time, they assist in creating baselines and identifying anomalies. However, to ensure alignment between monitoring activities and business objectives, it's pivotal to define and monitor KPIs. Business KPIs often make it easier to identify issues compared to technical metrics alone. 

 Other telemetry types, like real user monitoring (RUM) and synthetic transactions, complement these primary data sources. RUM offers insights into real-time user interactions, whereas synthetic transactions simulate potential user behaviors, helping detect bottlenecks before real users encounter them. 

 **Desired outcome:** Derive actionable insights into the performance of your workload. These insights allow you to make proactive decisions about performance optimization, achieve increased workload stability, streamline CI/CD processes, and utilize resources effectively. 

 **Common anti-patterns:** 
+  Incomplete observability: Neglecting to incorporate observability at every layer of the workload, resulting in blind spots that can obscure vital system performance and behavior insights. 
+  Fragmented data view: When data is scattered across multiple tools and systems, it becomes challenging to maintain a holistic view of your workload's health and performance. 
+  User-reported issues: A sign that proactive issue detection through telemetry and business KPI monitoring is lacking. 

 **Benefits of establishing this best practice:** 
+  Informed decision-making: With insights from telemetry and business KPIs, you can make data-driven decisions. 
+  Improved operational efficiency: Data-driven resource utilization leads to cost-effectiveness. 
+  Enhanced workload stability: Faster detection and resolution of issues leading to improved uptime. 
+  Streamlined CI/CD processes: Insights from telemetry data facilitate refinement of processes and reliable code delivery. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To implement application telemetry for your workload, use AWS services like [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) and [AWS X-Ray](https://aws.amazon.com/xray/). Amazon CloudWatch provides a comprehensive suite of monitoring tools, allowing you to observe your resources and applications in AWS and on-premises environments. It collects, tracks, and analyzes metrics, consolidates and monitors log data, and responds to changes in your resources, enhancing your understanding of how your workload operates. In tandem, AWS X-Ray lets you trace, analyze, and debug your applications, giving you a deep understanding of your workload's behavior. With features like service maps, latency distributions, and trace timelines, X-Ray provides insights into your workload's performance and the bottlenecks affecting it. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Identify what data to collect:** Ascertain the essential metrics, logs, and traces that would offer substantial insights into your workload's health, performance, and behavior. 

1.  **Deploy the [CloudWatch](https://aws.amazon.com/cloudwatch/) agent:** The CloudWatch agent is instrumental in procuring system and application metrics and logs from your workload and its underlying infrastructure. The CloudWatch agent can also be used to collect OpenTelemetry or X-Ray traces and send them to X-Ray. 

1.  **Define and monitor business KPIs:** Establish [custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) that align with your [business outcomes](https://aws-observability.github.io/observability-best-practices/guides/operational/business/monitoring-for-business-outcomes/). 

1.  **Instrument your application with AWS X-Ray:** In addition to deploying the CloudWatch agent, it's crucial to [instrument your application](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html) to emit trace data. This process can provide further insights into your workload's behavior and performance. 

1.  **Standardize data collection across your application:** Standardize data collection practices across your entire application. Uniformity aids in correlating and analyzing data, providing a comprehensive view of your application's behavior. 

1.  **Analyze and act on the data:** Once data collection and normalization are in place, use [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/features/) for metrics and logs analysis, and [AWS X-Ray](https://aws.amazon.com/xray/features/) for trace analysis. Such analysis can yield crucial insights into your workload's health, performance, and behavior, guiding your decision-making process. 

 **Level of effort for the implementation plan:** High 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 

 **Related documents:** 
+ [AWS Observability Best Practices ](https://aws-observability.github.io/observability-best-practices/)
+ [ CloudWatch User Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [AWS X-Ray Developer Guide ](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)
+ [ Instrumenting distributed systems for operational visibility ](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility)
+ [AWS Observability Skill Builder Course ](https://explore.skillbuilder.aws/learn/course/external/view/elearning/14688/aws-observability)
+ [ What's New with Amazon CloudWatch ](https://aws.amazon.com/about-aws/whats-new/management-and-governance/?whats-new-content.sort-by=item.additionalFields.postDateTime&whats-new-content.sort-order=desc&awsf.whats-new-products=general-products%23amazon-cloudwatch)
+ [ What's New with AWS X-Ray](https://aws.amazon.com/about-aws/whats-new/developer-tools/?whats-new-content.sort-by=item.additionalFields.postDateTime&whats-new-content.sort-order=desc&awsf.whats-new-products=general-products%23aws-x-ray)

 **Related videos:** 
+ [AWS re:Invent 2022 - Observability best practices at Amazon ](https://youtu.be/zZPzXEBW4P8)
+ [AWS re:Invent 2022 - Developing an observability strategy ](https://youtu.be/Ub3ATriFapQ)

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) 
+ [AWS Solutions Library: Application Monitoring with Amazon CloudWatch ](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch)

# OPS04-BP03 Implement user experience telemetry
<a name="ops_observability_customer_telemetry"></a>

 Gaining deep insights into customer experiences and interactions with your application is crucial. Real user monitoring (RUM) and synthetic transactions serve as powerful tools for this purpose. RUM provides data about real user interactions granting an unfiltered perspective of user satisfaction, while synthetic transactions simulate user interactions, helping in detecting potential issues even before they impact real users. 

 **Desired outcome:** A holistic view of the customer experience, proactive detection of issues, and optimization of user interactions to deliver seamless digital experiences. 

 **Common anti-patterns:** 
+  Applications without real user monitoring (RUM): 
  +  Delayed issue detection: Without RUM, you might not become aware of performance bottlenecks or issues until users complain. This reactive approach can lead to customer dissatisfaction. 
  +  Lack of user experience insights: Not using RUM means you lose out on crucial data that shows how real users interact with your application, limiting your ability to optimize the user experience. 
+  Applications without synthetic transactions: 
  +  Missed edge cases: Synthetic transactions help you test paths and functions that might not be frequently used by typical users but are critical to certain business functions. Without them, these paths could malfunction and go unnoticed. 
  +  Checking for issues when the application is not being used: Regular synthetic testing can simulate times when real users aren't actively interacting with your application, ensuring the system always functions correctly. 

 **Benefits of establishing this best practice:** 
+  Proactive issue detection: Identify and address potential issues before they impact real users. 
+  Optimized user experience: Continuous feedback from RUM aids in refining and enhancing the overall user experience. 
+  Insights on device and browser performance: Understand how your application performs across various devices and browsers, enabling further optimization. 
+  Validated business workflows: Regular synthetic transactions ensure that core functionalities and critical paths remain operational and efficient. 
+  Enhanced application performance: Leverage insights gathered from real user data to improve application responsiveness and reliability. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To leverage RUM and synthetic transactions for user activity telemetry, AWS offers services like [Amazon CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) and [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html). Metrics, logs, and traces, coupled with user activity data, provide a comprehensive view of both the application's operational state and the user experience. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Deploy Amazon CloudWatch RUM:** Integrate your application with CloudWatch RUM to collect, analyze, and present real user data. 

   1.  Use the [CloudWatch RUM JavaScript library](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) to integrate RUM with your application. 

   1.  Set up dashboards to visualize and monitor real user data. 

1.  **Configure CloudWatch Synthetics:** Create canaries, or scripted routines, that simulate user interactions with your application. 

   1.  Define critical application workflows and paths. 

   1.  Design canaries using [CloudWatch Synthetics scripts](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to simulate user interactions for these paths. 

   1.  Schedule and monitor canaries to run at specified intervals, ensuring consistent performance checks. 

1.  **Analyze and act on data:** Utilize data from RUM and synthetic transactions to gain insights and take corrective measures when anomalies are detected. Use CloudWatch dashboards and alarms to stay informed. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 

 **Related documents:** 
+ [ Amazon CloudWatch RUM Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html)
+ [ Amazon CloudWatch Synthetics Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)

 **Related videos:** 
+ [ Optimize applications through end user insights with Amazon CloudWatch RUM ](https://www.youtube.com/watch?v=NMaeujY9A9Y)
+ [AWS on Air ft. Real-User Monitoring for Amazon CloudWatch ](https://www.youtube.com/watch?v=r6wFtozsiVE)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Git Repository for Amazon CloudWatch RUM Web Client ](https://github.com/aws-observability/aws-rum-web)
+ [ Using Amazon CloudWatch Synthetics to measure page load time ](https://github.com/aws-samples/amazon-cloudwatch-synthetics-page-performance)

# OPS04-BP04 Implement dependency telemetry
<a name="ops_observability_dependency_telemetry"></a>

 Dependency telemetry is essential for monitoring the health and performance of the external services and components your workload relies on. It provides valuable insights into reachability, timeouts, and other critical events related to dependencies such as DNS, databases, or third-party APIs. By instrumenting your application to emit metrics, logs and traces about these dependencies, you gain a clearer understanding of potential bottlenecks, performance issues, or failures that might impact your workload. 

 **Desired outcome:** The dependencies your workload relies on are performing as expected, allowing you to proactively address issues and ensure optimal workload performance. 

 **Common anti-patterns:** 
+  Overlooking external dependencies: Focusing only on internal application metrics while neglecting metrics related to external dependencies. 
+  Lack of proactive monitoring: Waiting for issues to arise instead of continuously monitoring dependency health and performance. 
+  Siloed monitoring: Using multiple, disparate monitoring tools which can result in fragmented and inconsistent views of dependency health. 

 **Benefits of establishing this best practice:** 
+  Improved workload reliability: By ensuring that external dependencies are consistently available and performing optimally. 
+  Faster issue detection and resolution: Proactively identifying and addressing issues with dependencies before they impact the workload. 
+  Comprehensive view: Gaining a holistic view of both internal and external components that influence workload health. 
+  Enhanced workload scalability: By understanding the scalability limits and performance characteristics of external dependencies. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement dependency telemetry by starting with identifying the services, infrastructure, and processes that your workload depends on. Quantify what good conditions look like when those dependencies are functioning as expected, and then determine what data is needed to measure those. With that information, you can craft dashboards and alerts that provide insights to your operations teams on the state of those dependencies. Use AWS tools to discover and quantify the impacts when dependencies cannot deliver as needed. Continually revisit your strategy to account for changes in priorities, goals, and gained insights. 

### Implementation steps
<a name="implementation-steps"></a>

 To implement dependency telemetry effectively: 

1.  **Identify external dependencies:** Collaborate with stakeholders to pinpoint the external dependencies your workload relies on. External dependencies can encompass services like external databases, third-party APIs, network connectivity routes to other environments, and DNS services. The first step towards effective dependency telemetry is being comprehensive in understanding what those dependencies are. 

1.  **Develop a monitoring strategy:** Once you have a clear picture of your external dependencies, architect a monitoring strategy tailored to them. This involves understanding the criticality of each dependency, its expected behavior, and any associated service-level agreements or targets (SLA or SLTs). Set up proactive alerts to notify you of status changes or performance deviations. 

1.  **Leverage [Amazon CloudWatch Internet Monitor](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html):** It offers insights into the global internet, helping to understand outages or disruptions that might impact your external dependencies. 

1.  **Stay informed with [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/):** It provides alerts and remediation guidance when AWS is experiencing events that may impact your services. 

1.  **Instrument your application with [AWS X-Ray](https://aws.amazon.com/xray/):** AWS X-Ray provides insights into how applications and their underlying dependencies are performing. By tracing requests from start to end, you can identify bottlenecks or failures in the external services or components your application relies on. 

1.  **Use [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/):** This machine learning-driven service identifies operational issues, predicts when critical issues might occur, and recommends specific actions to take. It's invaluable for gaining insights into dependencies and determining that they're not the source of operational problems. 

1.  **Monitor regularly:** Continually monitor metrics and logs related to external dependencies. Set up alerts for unexpected behavior or degraded performance. 

1.  **Validate after changes:** Whenever there's an update or change in any of the external dependencies, validate their performance and check their alignment with your application's requirements. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 

 **Related documents:** 
+ [ What is AWS Health? ](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html)
+ [ Using Amazon CloudWatch Internet Monitor ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html)
+ [AWS X-Ray Developer Guide](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)
+ [ Amazon DevOps Guru User Guide ](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html)

 **Related videos:** 
+ [ Visibility into how internet issues impact app performance ](https://www.youtube.com/watch?v=Kuc_SG_aBgQ)
+ [ Introduction to Amazon DevOps Guru ](https://www.youtube.com/watch?v=2uA8q-8mTZY)

 **Related examples:** 
+ [ Gaining operational insights with AIOps using Amazon DevOps Guru ](https://catalog.us-east-1.prod.workshops.aws/workshops/f92df379-6add-4101-8b4b-38b788e1222b/en-US)
+ [AWS Health Aware ](https://github.com/aws-samples/aws-health-aware/)

# OPS04-BP05 Implement distributed tracing
<a name="ops_observability_dist_trace"></a>

 Distributed tracing offers a way to monitor and visualize requests as they traverse through various components of a distributed system. By capturing trace data from multiple sources and analyzing it in a unified view, teams can better understand how requests flow, where bottlenecks exist, and where optimization efforts should focus. 

 **Desired outcome:** Achieve a holistic view of requests flowing through your distributed system, allowing for precise debugging, optimized performance, and improved user experiences. 

 **Common anti-patterns:** 
+  Inconsistent instrumentation: Not all services in a distributed system are instrumented for tracing. 
+  Ignoring latency: Only focusing on errors and not considering the latency or gradual performance degradations. 

 **Benefits of establishing this best practice:** 
+ Comprehensive system overview: Visualizing the entire path of requests, from entry to exit.
+  Enhanced debugging: Quickly identifying where failures or performance issues occur. 
+  Improved user experience: Monitoring and optimizing based on actual user data, ensuring the system meets real-world demands. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Begin by identifying all of the elements of your workload that require instrumentation. Once all components are accounted for, leverage tools such as AWS X-Ray and OpenTelemetry to gather trace data for analysis with tools like X-Ray and Amazon CloudWatch ServiceLens Map. Engage in regular reviews with developers, and supplement these discussions with tools like Amazon DevOps Guru, X-Ray Analytics and X-Ray Insights to help uncover deeper findings. Establish alerts from trace data to notify when outcomes, as defined in the workload monitoring plan, are at risk. 

### Implementation steps
<a name="implementation-steps"></a>

 To implement distributed tracing effectively: 

1.  **Adopt [AWS X-Ray](https://aws.amazon.com/xray/):** Integrate X-Ray into your application to gain insights into its behavior, understand its performance, and pinpoint bottlenecks. Utilize X-Ray Insights for automatic trace analysis. 

1.  **Instrument your services:** Verify that every service, from an [AWS Lambda](https://aws.amazon.com/lambda/) function to an [EC2 instance](https://aws.amazon.com/ec2/), sends trace data. The more services you instrument, the clearer the end-to-end view. 

1.  **Incorporate [CloudWatch Real User Monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) and [synthetic monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html):** Integrate Real User Monitoring (RUM) and synthetic monitoring with X-Ray. This allows for capturing real-world user experiences and simulating user interactions to identify potential issues. 

1.  **Use the [CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html):** The agent can send traces from either X-Ray or OpenTelemetry, enhancing the depth of insights obtained. 

1.  **Use [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/):** DevOps Guru uses data from X-Ray, CloudWatch, AWS Config, and AWS CloudTrail to provide actionable recommendations. 

1.  **Analyze traces:** Regularly review the trace data to discern patterns, anomalies, or bottlenecks that might impact your application's performance. 

1.  **Set up alerts:** Configure alarms in [CloudWatch](https://aws.amazon.com/cloudwatch/) for unusual patterns or extended latencies, allowing proactive issue addressing. 

1.  **Continuous improvement:** Revisit your tracing strategy as services are added or modified to capture all relevant data points. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 

 **Related documents:** 
+ [AWS X-Ray Developer Guide ](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)
+ [ Amazon CloudWatch agent User Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html)
+ [ Amazon DevOps Guru User Guide ](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html)

 **Related videos:** 
+ [ Use AWS X-Ray Insights ](https://www.youtube.com/watch?v=tl8OWHl6jxw)
+ [AWS on Air ft. Observability: Amazon CloudWatch and AWS X-Ray](https://www.youtube.com/watch?v=qBDBnPkZ-KI)

 **Related examples:** 
+ [ Instrumenting your Application with AWS X-Ray](https://aws.amazon.com/getting-started/hands-on/distributed-tracing-with-xray/)

# OPS 5. How do you reduce defects, ease remediation, and improve flow into production?
<a name="ops-05"></a>

 Adopt approaches that improve flow of changes into production, that activate refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and achieve rapid identification and remediation of issues introduced through deployment activities. 

**Topics**
+ [OPS05-BP01 Use version control](ops_dev_integ_version_control.md)
+ [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md)
+ [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md)
+ [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md)
+ [OPS05-BP05 Perform patch management](ops_dev_integ_patch_mgmt.md)
+ [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md)
+ [OPS05-BP07 Implement practices to improve code quality](ops_dev_integ_code_quality.md)
+ [OPS05-BP08 Use multiple environments](ops_dev_integ_multi_env.md)
+ [OPS05-BP09 Make frequent, small, reversible changes](ops_dev_integ_freq_sm_rev_chg.md)
+ [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md)

# OPS05-BP01 Use version control
<a name="ops_dev_integ_version_control"></a>

 Use version control to activate tracking of changes and releases. 

 Many AWS services offer version control capabilities. Use a revision or source control system such as [AWS CodeCommit](https://aws.amazon.com/codecommit/) to manage code and other artifacts, such as version-controlled [AWS CloudFormation](https://aws.amazon.com/cloudformation/) templates of your infrastructure. 

 **Desired outcome:** Your teams collaborate on code. When merged, the code is consistent and no changes are lost. Errors are easily reverted through correct versioning. 

 **Common anti-patterns:** 
+  You have been developing and storing your code on your workstation. You have had an unrecoverable storage failure on the workstation and your code is lost. 
+  After overwriting the existing code with your changes, you restart your application and it is no longer operable. You are unable to revert the change. 
+  You have a write lock on a report file that someone else needs to edit. They contact you asking that you stop work on it so that they can complete their tasks. 
+  Your research team has been working on a detailed analysis that shapes your future work. Someone has accidentally saved their shopping list over the final report. You are unable to revert the change and have to recreate the report. 

 **Benefits of establishing this best practice:** By using version control capabilities you can easily revert to known good states and previous versions, and limit the risk of assets being lost. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Maintain assets in version controlled repositories. Doing so supports tracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the event of a failure). Integrate the version control capabilities of your configuration management systems into your procedures. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 

 **Related documents:** 
+  [What is AWS CodeCommit?](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Introduction to AWS CodeCommit](https://youtu.be/46PRLMW8otg) 

# OPS05-BP02 Test and validate changes
<a name="ops_dev_integ_test_val_chg"></a>

 Every change deployed must be tested to avoid errors in production. This best practice is focused on testing changes from version control to artifact build. Besides application code changes, testing should include infrastructure, configuration, security controls, and operations procedures. Testing takes many forms, from unit tests to software component analysis (SCA). Move tests further to the left in the software integration and delivery process results in higher certainty of artifact quality. 

 Your organization must develop testing standards for all software artifacts. Automated tests reduce toil and avoid manual test errors. Manual tests may be necessary in some cases. Developers must have access to automated test results to create feedback loops that improve software quality. 

 **Desired outcome:** Your software changes are tested before they are delivered. Developers have access to test results and validations. Your organization has a testing standard that applies to all software changes. 

 **Common anti-patterns:** 
+ You deploy a new software change without any tests. It fails to run in production, which leads to an outage.
+ New security groups are deployed with CloudFormation without being tested in a pre-production environment. The security groups make your app unreachable for your customers.
+ A method is modified but there are no unit tests. The software fails when it is deployed to production.

 **Benefits of establishing this best practice:** Change fail rate of software deployments are reduced. Software quality is improved. Developers have increased awareness on the viability of their code. Security policies can be rolled out with confidence to support organization's compliance. Infrastructure changes such as automatic scaling policy updates are tested in advance to meet traffic needs. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Testing is done on all changes, from application code to infrastructure, as part of your continuous integration practice. Test results are published so that developers have fast feedback. Your organization has a testing standard that all changes must pass. 

 **Customer example** 

 As part of their continuous integration pipeline, AnyCompany Retail conducts several types of tests on all software artifacts. They practice test driven development so all software has unit tests. Once the artifact is built, they run end-to-end tests. After this first round of tests is complete, they run a static application security scan, which looks for known vulnerabilities. Developers receive messages as each testing gate is passed. Once all tests are complete, the software artifact is stored in an artifact repository. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Work with stakeholders in your organization to develop a testing standard for software artifacts. What standard tests should all artifacts pass? Are there compliance or governance requirements that must be included in the test coverage? Do you need to conduct code quality tests? When tests complete, who needs to know? 

   1.  The [AWS Deployment Pipeline Reference Architecture](https://pipelines.devops.aws.dev/) contains an authoritative list of types of tests that can be conducted on software artifacts as part of an integration pipeline. 

1.  Instrument your application with the necessary tests based on your software testing standard. Each set of tests should complete in under ten minutes. Tests should run as part of an integration pipeline. 

   1.  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) can test your application code for defects. 

   1.  You can use [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) to conduct tests on software artifacts. 

   1.  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) can orchestrate your software tests into a pipeline. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP01 Use version control](ops_dev_integ_version_control.md) 
+  [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md) 
+  [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md) 

 **Related documents:** 
+ [ Adopt a test-driven development approach ](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html)
+ [ Automated CloudFormation Testing Pipeline with TaskCat and CodePipeline ](https://aws.amazon.com/blogs/devops/automated-cloudformation-testing-pipeline-with-taskcat-and-codepipeline/)
+ [ Building end-to-end AWS DevSecOps CI/CD pipeline with open source SCA, SAST, and DAST tools ](https://aws.amazon.com/blogs/devops/building-end-to-end-aws-devsecops-ci-cd-pipeline-with-open-source-sca-sast-and-dast-tools/)
+ [ Getting started with testing serverless applications ](https://aws.amazon.com/blogs/compute/getting-started-with-testing-serverless-applications/)
+ [ My CI/CD pipeline is my release captain ](https://aws.amazon.com/builders-library/cicd-pipeline/)
+ [ Practicing Continuous Integration and Continuous Delivery on AWS Whitepaper ](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/welcome.html)

 **Related videos:** 
+ [AWS re:Invent 2020: Testable infrastructure: Integration testing on AWS](https://www.youtube.com/watch?v=KJC380Juo2w)
+ [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development ](https://www.youtube.com/watch?v=1R7G_wcyd3s)
+ [ Testing Your Infrastructure as Code with AWS CDK ](https://www.youtube.com/watch?v=fWtuwGSoSOU)

 **Related resources:** 
+ [AWS Deployment Pipeline Reference Architecture - Application ](https://pipelines.devops.aws.dev/application-pipeline/index.html)
+ [AWS Kubernetes DevSecOps Pipeline ](https://github.com/aws-samples/devsecops-cicd-containers)
+ [ Policy as Code Workshop – Test Driven Development ](https://catalog.us-east-1.prod.workshops.aws/workshops/9da471a0-266a-4d36-8596-e5934aeedd1f/en-US/pac-tools/cfn-guard/tdd)
+ [ Run unit tests for a Node.js application from GitHub by using AWS CodeBuild](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-unit-tests-for-a-node-js-application-from-github-by-using-aws-codebuild.html)
+ [ Use Serverspec for test-driven development of infrastructure code ](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/use-serverspec-for-test-driven-development-of-infrastructure-code.html)

 **Related services:** 
+  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) 
+  [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 

# OPS05-BP03 Use configuration management systems
<a name="ops_dev_integ_conf_mgmt_sys"></a>

 Use configuration management systems to make and track configuration changes. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

Static configuration management sets values when initializing a resource that are expected to remain consistent throughout the resource’s lifetime. Dynamic configuration management sets values at initialization that can or are expected to change during the lifetime of a resource. For example, you could set a feature toggle to activate functionality in your code through a configuration change, or change the level of log detail during an incident.

Configurations should be deployed in a known and consistent state. You should use automated inspection to continually monitor resource configurations across environments and regions. These controls should be defined as code and management automated to ensure rules are consistently appplied across environments. Changes to configurations should be updated through agreed change control procedures and applied consistently, honoring version control. Application configuration should be managed independantly of application and infrastructure code. This allows for consistent deployment across multiple environments. Configuration changes do not result in rebuilding or redeploying the application. 

 **Desired outcome:** You configure, validate, and deploy as part of your continuous integration, continuous delivery (CI/CD) pipeline. You monitor to validate configurations are correct. This minimizes any impact to end users and customers. 

 **Common anti-patterns:** 
+  You manually update the web server configuration across your fleet and a number of servers become unresponsive due to update errors. 
+  You manually update your application server fleet over the course of many hours. The inconsistency in configuration during the change causes unexpected behaviors. 
+  Someone has updated your security groups and your web servers are no longer accessible. Without knowledge of what was changed you spend significant time investigating the issue extending your time to recovery. 
+  You push a pre-production configuration into production through CI/CD without validation. You expose users and customers to incorrect data and services. 

 **Benefits of establishing this best practice:** Adopting configuration management systems reduces the level of effort to make and track changes, and the frequency of errors caused by manual procedures. Configuration management systems provide assurances with regards to governance, compliance, and regulatory requirements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Configuration management systems are used to track and implement changes to application and environment configurations. Configuration management systems are also used to reduce errors caused by manual processes, make configuration changes repeatable and auditable, and reduce the level of effort. 

 On AWS, you can use [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to continually monitor your AWS resource configurations [across accounts and Regions](https://docs.aws.amazon.com/config/latest/developerguide/aggregate-data.html). It helps you to track their configuration history, understand how a configuration change would affect other resources, and audit them against expected or desired configurations using [AWS Config Rules](https://docs.aws.amazon.com/config/latest/developerguide/evaluate-config.html) and [AWS Config Conformance Packs](https://docs.aws.amazon.com/config/latest/developerguide/conformance-packs.html). 

 For dynamic configurations in your applications running on Amazon EC2 instances, AWS Lambda, containers, mobile applications, or IoT devices, you can use [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) to configure, validate, deploy, and monitor them across your environments. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Identify configuration owners. 

   1.  Make configurations owners aware of any compliance, governance, or regulatory needs. 

1.  Identify configuration items and deliverables. 

   1.  Configuration items are all application and environmental configurations affected by a deployment within your CI/CD pipeline. 

   1.  Deliverables include success criteria, validation, and what to monitor. 

1.  Select tools for configuration management based on your business requirements and delivery pipeline. 

1.  Consider a weighted deployments such are canary deployments for significant configuration changes to minimise the impact of incorrect configurations. 

1.  Integrate your configuration management into your CI/CD pipeline. 

1.  Validate all changes pushed. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) 
+  [OPS06-BP03 Employ safe deployment strategies](ops_mit_deploy_risks_deploy_mgmt_sys.md) 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html)
+ [AWS Landing Zone Accelerator ](https://aws.amazon.com/solutions/implementations/landing-zone-accelerator-on-aws/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [ What is AWS Config? ](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html)
+  [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) 
+ [ What is AWS CloudFormation? ](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html)
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 

 **Related videos:** 
+ [AWS re:Invent 2022 - Proactive governance and compliance for AWS workloads ](https://youtu.be/PpUnH9Y52X0?si=82wff87KHXcc6nbT)
+ [AWS re:Invent 2020: Achieve compliance as code using AWS Config](https://youtu.be/m8vTwvbzOfw?si=my4DP0FLq1zwKjho)
+ [ Manage and Deploy Application Configurations with AWS AppConfig](https://youtu.be/ztIxMY3IIu0?si=ovYGsxWOBysyQrg0)

# OPS05-BP04 Use build and deployment management systems
<a name="ops_dev_integ_build_mgmt_sys"></a>

 Use build and deployment management systems. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

 In AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using services such as [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) (for example, AWS CodeCommit, [AWS CodeBuild](https://aws.amazon.com/codebuild/), [AWS CodePipeline](https://aws.amazon.com/codepipeline/), [AWS CodeDeploy](https://aws.amazon.com/codedeploy/), and [AWS CodeStar](https://aws.amazon.com/codestar/)). 

 **Desired outcome:** Your build and deployment management systems support your organization's continuous integration continuous delivery (CI/CD) system that provide capabilities for automating safe rollouts with the correct configurations. 

 **Common anti-patterns:** 
+  After compiling your code on your development system, you copy the executable onto your production systems and it fails to start. The local log files indicates that it has failed due to missing dependencies. 
+  You successfully build your application with new features in your development environment and provide the code to quality assurance (QA). It fails QA because it is missing static assets. 
+  On Friday, after much effort, you successfully built your application manually in your development environment including your newly coded features. On Monday, you are unable to repeat the steps that allowed you to successfully build your application. 
+  You perform the tests you have created for your new release. Then you spend the next week setting up a test environment and performing all the existing integration tests followed by the performance tests. The new code has an unacceptable performance impact and must be redeveloped and then retested. 

 **Benefits of establishing this best practice:** By providing mechanisms to manage build and deployment activities you reduce the level of effort to perform repetitive tasks, free your team members to focus on their high value creative tasks, and limit the introduction of error from manual procedures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Build and deployment management systems are used to track and implement change, reduce errors caused by manual processes, and reduce the level of effort required for safe deployments. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, decreases cost, encourages increased frequency of change, reduces the level of effort, and increases collaboration. 

### Implementation steps
<a name="implementation-steps"></a>

![\[Diagram showing a CI/CD pipeline using AWS CodePipeline and related services\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/deployment-pipeline-tooling.png)


1.  Use AWS CodeCommit to version control, store, and manage assets (such as documents, source code, and binary files). 

1.  Use CodeBuild to compile your source code, runs unit tests, and produces artifacts that are ready to deploy. 

1.  Use CodeDeploy as a deployment service that automates application deployments to [Amazon EC2](https://aws.amazon.com/ec2/) instances, on-premises instances, [serverless AWS Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html), or [Amazon ECS](https://aws.amazon.com/ecs/). 

1.  Monitor your deployments. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+ [ What is AWS CodeCommit? ](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html)
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+ [AWS CodeBuild](https://aws.amazon.com/codebuild/)
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+ [AWS re:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS](https://youtu.be/hfXokRAyorA)

# OPS05-BP05 Perform patch management
<a name="ops_dev_integ_patch_mgmt"></a>

 Perform patch management to gain features, address issues, and remain compliant with governance. Automate patch management to reduce errors caused by manual processes, scale, and reduce the level of effort to patch. 

 Patch and vulnerability management are part of your benefit and risk management activities. It is preferable to have immutable infrastructures and deploy workloads in verified known good states. Where that is not viable, patching in place is the remaining option. 

 [Amazon EC2 Image Builder](https://aws.amazon.com/image-builder/) provides pipelines to update machine images. As a part of patch management, consider [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html       ) (AMIs) using an [AMI image pipeline](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-image-pipeline.html) or container images with a [Docker image pipeline](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-container-pipeline.html), while AWS Lambda provides patterns for [custom runtimes and additional libraries](https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom.html) to remove vulnerabilities. 

 You should manage updates to [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) for Linux or Windows Server images using [Amazon EC2 Image Builder](https://aws.amazon.com/image-builder/). You can use [Amazon Elastic Container Registry (Amazon ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) with your existing pipeline to manage Amazon ECS images and manage Amazon EKS images. Lambda includes [version management features](https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html). 

 Patching should not be performed on production systems without first testing in a safe environment. Patches should only be applied if they support an operational or business outcome. On AWS, you can use [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) to automate the process of patching managed systems and schedule the activity using [Systems Manager Maintenance Windows](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html). 

 **Desired outcome:** Your AMI and container images are patched, up-to-date, and ready for launch. You are able to track the status of all deployed images and know patch compliance. You are able to report on current status and have a process to meet your compliance needs. 

 **Common anti-patterns:** 
+  You are given a mandate to apply all new security patches within two hours resulting in multiple outages due to application incompatibility with patches. 
+  An unpatched library results in unintended consequences as unknown parties use vulnerabilities within it to access your workload. 
+  You patch the developer environments automatically without notifying the developers. You receive multiple complaints from the developers that their environment cease to operate as expected. 
+  You have not patched the commercial off-the-shelf software on a persistent instance. When you have an issue with the software and contact the vendor, they notify you that version is not supported and you have to patch to a specific level to receive any assistance. 
+  A recently released patch for the encryption software you used has significant performance improvements. Your unpatched system has performance issues that remain in place as a result of not patching. 
+  You are notified of a zero-day vulnerability requiring an emergency fix and you have to patch all your environments manually. 

 **Benefits of establishing this best practice:** By establishing a patch management process, including your criteria for patching and methodology for distribution across your environments, you can scale and report on patch levels. This provides assurances around security patching and ensure clear visibility on the status of known fixes being in place. This encourages adoption of desired features and capabilities, the rapid removal of issues, and sustained compliance with governance. Implement patch management systems and automation to reduce the level of effort to deploy patches and limit errors caused by manual processes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Patch systems to remediate issues, to gain desired features or capabilities, and to remain compliant with governance policy and vendor support requirements. In immutable systems, deploy with the appropriate patch set to achieve the desired result. Automate the patch management mechanism to reduce the elapsed time to patch, to avoid errors caused by manual processes, and lower the level of effort to patch. 

### Implementation steps
<a name="implementation-steps"></a>

 For Amazon EC2 Image Builder: 

1.  Using Amazon EC2 Image Builder, specify pipeline details: 

   1.  Create an image pipeline and name it 

   1.  Define pipeline schedule and time zone 

   1.  Configure any dependencies 

1.  Choose a recipe: 

   1.  Select existing recipe or create a new one 

   1.  Select image type 

   1.  Name and version your recipe 

   1.  Select your base image 

   1.  Add build components and add to target registry 

1.  Optional - define your infrastructure configuration. 

1.  Optional - define configuration settings. 

1.  Review settings. 

1.  Maintain recipe hygiene regularly. 

 For Systems Manager Patch Manager: 

1.  Create a patch baseline. 

1.  Select a pathing operations method. 

1.  Enable compliance reporting and scanning. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [ What is Amazon EC2 Image Builder ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/what-is-image-builder.html)
+ [ Create an image pipeline using the Amazon EC2 Image Builder ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-image-pipeline.html)
+ [ Create a container image pipeline ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-container-pipeline.html)
+  [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 
+ [ Working with Patch Manager ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-console.html)
+ [ Working with patch compliance reports ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-compliance-reports.html)
+ [AWS Developer Tools ](https://aws.amazon.com/products/developer-tools)

 **Related videos:** 
+  [CI/CD for Serverless Applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
+  [Design with Ops in Mind](https://youtu.be/uh19jfW7hw4) 

   **Related examples:** 
+ [ Well-Architected Labs - Inventory and Patch Management ](https://wellarchitectedlabs.com/operational-excellence/100_labs/100_inventory_patch_management)
+ [AWS Systems Manager Patch Manager tutorials ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-tutorials.html)

# OPS05-BP06 Share design standards
<a name="ops_dev_integ_share_design_stds"></a>

 Share best practices across teams to increase awareness and maximize the benefits of development efforts. Document them and keep them up to date as your architecture evolves. If shared standards are enforced in your organization, it’s critical that mechanisms exist to request additions, changes, and exceptions to standards. Without this option, standards become a constraint on innovation. 

 **Desired outcome:** Design standards are shared across teams in your organizations. They are documented and kept up-to-date as best practices evolve. 

 **Common anti-patterns:** 
+ Two development teams have each created a user authentication service. Your users must maintain a separate set of credentials for each part of the system they want to access. 
+ Each team manages their own infrastructure. A new compliance requirement forces a change to your infrastructure and each team implements it in a different way.

 **Benefits of establishing this best practice:** Using shared standards supports the adoption of best practices and maximizes the benefits of development efforts. Documenting and updating design standards keeps your organization up-to-date with best practices and security and compliance requirements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Share existing best practices, design standards, checklists, operating procedures, guidance, and governance requirements across teams. Have procedures to request changes, additions, and exceptions to design standards to support improvement and innovation. Make teams are aware of published content. Have a mechanism to keep design standards up-to-date as new best practices emerge. 

 **Customer example** 

 AnyCompany Retail has a cross-functional architecture team that creates software architecture patterns. This team builds the architecture with compliance and governance built in. Teams that adopt these shared standards get the benefits of having compliance and governance built in. They can quickly build on top of the design standard. The architecture team meets quarterly to evaluate architecture patterns and update them if necessary. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Identify a cross-functional team that owns developing and updating design standards. This team should work with stakeholders across your organization to develop design standards, operating procedures, checklists, guidance, and governance requirements. Document the design standards and share them within your organization. 

   1.  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) can be used to create portfolios representing design standards using infrastructure as code. You can share portfolios across accounts. 

1.  Have a mechanism in place to keep design standards up-to-date as new best practices are identified. 

1.  If design standards are centrally enforced, have a process to request changes, updates, and exemptions. 

 **Level of effort for the implementation plan:** Medium. Developing a process to create and share design standards can take coordination and cooperation with stakeholders across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements influence design standards. 
+  [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) - Compliance is a vital input in creating design standards. 
+  [OPS07-BP02 Ensure a consistent review of operational readiness](ops_ready_to_support_const_orr.md) - Operational readiness checklists are a mechanism to implement design standards when designing your workload. 
+  [OPS11-BP01 Have a process for continuous improvement](ops_evolve_ops_process_cont_imp.md) - Updating design standards is a part of continuous improvement. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - As part of your knowledge management practice, document and share design standards. 

 **Related documents:** 
+ [ Automate AWS Backups with AWS Service Catalog](https://aws.amazon.com/blogs/mt/automate-aws-backups-with-aws-service-catalog/)
+ [AWS Service Catalog Account Factory-Enhanced ](https://aws.amazon.com/blogs/mt/aws-service-catalog-account-factory-enhanced/)
+ [ How Expedia Group built Database as a Service (DBaaS) offering using AWS Service Catalog](https://aws.amazon.com/blogs/mt/how-expedia-group-built-database-as-a-service-dbaas-offering-using-aws-service-catalog/)
+ [ Maintain visibility over the use of cloud architecture patterns ](https://aws.amazon.com/blogs/architecture/maintain-visibility-over-the-use-of-cloud-architecture-patterns/)
+ [ Simplify sharing your AWS Service Catalog portfolios in an AWS Organizations setup ](https://aws.amazon.com/blogs/mt/simplify-sharing-your-aws-service-catalog-portfolios-in-an-aws-organizations-setup/)

 **Related videos:** 
+ [AWS Service Catalog – Getting Started ](https://www.youtube.com/watch?v=A9kKy6WhqVA)
+ [AWS re:Invent 2020: Manage your AWS Service Catalog portfolios like an expert ](https://www.youtube.com/watch?v=lVfXkWHAtR8)

 **Related examples:** 
+ [AWS Service Catalog Reference Architecture ](https://github.com/aws-samples/aws-service-catalog-reference-architectures)
+ [AWS Service Catalog Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/d40750d7-a330-49be-9945-cde864610de9/en-US)

 **Related services:** 
+  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) 

# OPS05-BP07 Implement practices to improve code quality
<a name="ops_dev_integ_code_quality"></a>

Implement practices to improve code quality and minimize defects. Some examples include test-driven development, code reviews, standards adoption, and pair programming. Incorporate these practices into your continuous integration and delivery process. 

 **Desired outcome:** Your organization uses best practices like code reviews or pair programming to improve code quality. Developers and operators adopt code quality best practices as part of the software development lifecycle. 

 **Common anti-patterns:** 
+ You commit code to the main branch of your application without a code review. The change automatically deploys to production and causes an outage.
+  A new application is developed without any unit, end-to-end, or integration tests. There is no way to test the application before deployment. 
+  Your teams make manual changes in production to address defects. Changes do not go through testing or code reviews and are not captured or logged through continuous integration and delivery processes. 

 **Benefits of establishing this best practice:** By adopting practices to improve code quality, you can help minimize issues introduced to production. Code quality increases using best practices like pair programming and code reviews. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement practices to improve code quality to minimize defects before they are deployed. Use practices like test-driven development, code reviews, and pair programming to increase the quality of your development. 

 **Customer example** 

 AnyCompany Retail adopts several practices to improve code quality. They have adopted test-driven development as the standard for writing applications. For some new features, they have developers pair program together during a sprint. Every pull request goes through a code review by a senior developer before being integrated and deployed. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Adopt code quality practices like test-driven development, code reviews, and pair programming into your continuous integration and delivery process. Use these techniques to improve software quality. 

   1.  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) can provide programming recommendations for Java and Python code using machine learning. 

   1.  You can create shared development environments with [AWS Cloud9](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) where you can collaborate on developing code. 

 **Level of effort for the implementation plan:** Medium. There are many ways of implementing this best practice, but getting organizational adoption may be challenging. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md) - You can share design standards as part of your code quality practice. 

 **Related documents:** 
+ [ Agile Software Guide ](https://martinfowler.com/agile.html)
+ [My CI/CD pipeline is my release captain](https://aws.amazon.com/builders-library/cicd-pipeline/)
+ [ Automate code reviews with Amazon CodeGuru Reviewer ](https://aws.amazon.com/blogs/devops/automate-code-reviews-with-amazon-codeguru-reviewer/)
+ [ Adopt a test-driven development approach ](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html)
+ [ How DevFactory builds better applications with Amazon CodeGuru ](https://aws.amazon.com/blogs/machine-learning/how-devfactory-builds-better-applications-with-amazon-codeguru/)
+ [ On Pair Programming ](https://martinfowler.com/articles/on-pair-programming.html)
+ [ RENGA Inc. automates code reviews with Amazon CodeGuru ](https://aws.amazon.com/blogs/machine-learning/renga-inc-automates-code-reviews-with-amazon-codeguru/)
+ [ The Art of Agile Development: Test-Driven Development ](http://www.jamesshore.com/v2/books/aoad1/test_driven_development)
+ [ Why code reviews matter (and actually save time\$1) ](https://www.atlassian.com/agile/software-development/code-reviews)

 **Related videos:** 
+ [AWS re:Invent 2020: Continuous improvement of code quality with Amazon CodeGuru ](https://www.youtube.com/watch?v=iX1i35H1OVw)
+ [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development ](https://www.youtube.com/watch?v=1R7G_wcyd3s)

 **Related services:** 
+ [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html)
+ [ Amazon CodeGuru Profiler ](https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html)
+  [AWS Cloud9](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) 

# OPS05-BP08 Use multiple environments
<a name="ops_dev_integ_multi_env"></a>

 Use multiple environments to experiment, develop, and test your workload. Use increasing levels of controls as environments approach production to gain confidence your workload operates as intended when deployed. 

 **Desired outcome:** You have multiple environments that reflect your compliance and governance needs. You test and promote code through environments on your path to production. 

 **Common anti-patterns:** 
+  You are performing development in a shared development environment and another developer overwrites your code changes. 
+  The restrictive security controls on your shared development environment are preventing you from experimenting with new services and features. 
+  You perform load testing on your production systems and cause an outage for your users. 
+  A critical error resulting in data loss has occurred in production. In your production environment, you attempt to recreate the conditions that lead to the data loss so that you can identify how it happened and prevent it from happening again. To prevent further data loss during testing, you are forced to make the application unavailable to your users. 
+  You are operating a multi-tenant service and are unable to support a customer request for a dedicated environment. 
+  You may not always test, but when you do, you test in your production environment. 
+  You believe that the simplicity of a single environment overrides the scope of impact of changes within the environment. 

 **Benefits of establishing this best practice:** You can support multiple simultaneous development, testing, and production environments without creating conflicts between developers or user communities. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use multiple environments and provide developers sandbox environments with minimized controls to aid in experimentation. Provide individual development environments to help work in parallel, increasing development agility. Implement more rigorous controls in the environments approaching production to allow developers to innovate. Use infrastructure as code and configuration management systems to deploy environments that are configured consistent with the controls present in production to ensure systems operate as expected when deployed. When environments are not in use, turn them off to avoid costs associated with idle resources (for example, development systems on evenings and weekends). Deploy production equivalent environments when load testing to improve valid results. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Instance Scheduler on AWS](https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/)
+  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 

# OPS05-BP09 Make frequent, small, reversible changes
<a name="ops_dev_integ_freq_sm_rev_chg"></a>

 Frequent, small, and reversible changes reduce the scope and impact of a change. When used in conjunction with change management systems, configuration management systems, and build and delivery systems frequent, small, and reversible changes reduce the scope and impact of a change. This results in more effective troubleshooting and faster remediation with the option to roll back changes. 

 **Common anti-patterns:** 
+  You deploy a new version of your application quarterly with a change window that means a core service is turned off. 
+  You frequently make changes to your database schema without tracking changes in your management systems. 
+  You perform manual in-place updates, overwriting existing installations and configurations, and have no clear roll-back plan. 

 **Benefits of establishing this best practice:** Development efforts are faster by deploying small changes frequently. When the changes are small, it is much easier to identify if they have unintended consequences, and they are easier to reverse. When the changes are reversible, there is less risk to implementing the change, as recovery is simplified. The change process has a reduced risk and the impact of a failed change is reduced. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use frequent, small, and reversible changes to reduce the scope and impact of a change. This eases troubleshooting, helps with faster remediation, and provides the option to roll back a change. It also increases the rate at which you can deliver value to the business. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md) 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [ Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/microservices-on-aws.html)
+ [ Microservices - Observability ](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/observability.html)

# OPS05-BP10 Fully automate integration and deployment
<a name="ops_dev_integ_auto_integ_deploy"></a>

 Automate build, deployment, and testing of the workload. This reduces errors caused by manual processes and reduces the effort to deploy changes. 

 Apply metadata using [Resource Tags](https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html) and [AWS Resource Groups](https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html) following a consistent [tagging strategy](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/) to aid in identification of your resources. Tag your resources for organization, cost accounting, access controls, and targeting the run of automated operations activities. 

 **Desired outcome:** Developers use tools to deliver code and promote through to production. Developers do not have to log into the AWS Management Console to deliver updates. There is a full audit trail of change and configuration, meeting the needs of governance and compliance. Processes are repeatable and are standardized across teams. Developers are free to focus on development and code pushes, increasing productivity. 

 **Common anti-patterns:** 
+  On Friday, you finish authoring the new code for your feature branch. On Monday, after running your code quality test scripts and each of your unit tests scripts, you check in your code for the next scheduled release. 
+  You are assigned to code a fix for a critical issue impacting a large number of customers in production. After testing the fix, you commit your code and email change management to request approval to deploy it to production. 
+  As a developer, you log into the AWS Management Console to create a new development environment using non-standard methods and systems. 

 **Benefits of establishing this best practice:** By implementing automated build and deployment management systems, you reduce errors caused by manual processes and reduce the effort to deploy changes helping your team members to focus on delivering business value. You increase the speed of delivery as you promote through to production. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 You use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, reduces the level of effort, increases the speed to market, results in increased productivity, and increases the security of your code as you promote through to production. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md) 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 

 **Related documents:** 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+ [AWS re\$1:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS](https://youtu.be/hfXokRAyorA)

# OPS 6. How do you mitigate deployment risks?
<a name="ops-06"></a>

 Adopt approaches that provide fast feedback on quality and achieve rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes. 

**Topics**
+ [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md)
+ [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md)
+ [OPS06-BP03 Employ safe deployment strategies](ops_mit_deploy_risks_deploy_mgmt_sys.md)
+ [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md)

# OPS06-BP01 Plan for unsuccessful changes
<a name="ops_mit_deploy_risks_plan_for_unsucessful_changes"></a>

Plan to revert to a known good state, or remediate in the production environment if the deployment causes an undesired outcome. Having a policy to establish such a plan helps all teams develop strategies to recover from failed changes. Some example strategies are deployment and rollback steps, change policies, feature flags, traffic isolation, and traffic shifting. A single release may include multiple related component changes. The strategy should provide the ability to withstand or recover from a failure of any component change.

 **Desired outcome:** You have prepared a detailed recovery plan for your change in the event it is unsuccessful. In addition, you have reduced the size of your release to minimize the potential impact on other workload components. As a result, you have reduced your business impact by shortening the potential downtime caused by a failed change and increased the flexibility and efficiency of recovery times. 

 **Common anti-patterns:** 
+  You performed a deployment and your application has become unstable but there appear to be active users on the system. You have to decide whether to rollback the change and impact the active users or wait to rollback the change knowing the users may be impacted regardless. 
+  After making a routine change, your new environments are accessible, but one of your subnets has become unreachable. You have to decide whether to rollback everything or try to fix the inaccessible subnet. While you are making that determination, the subnet remains unreachable. 
+  Your systems are not architected in a way that allows them to be updated with smaller releases. As a result, you have difficulty in reversing those bulk changes during a failed deployment. 
+  You do not use infrastructure as code (IaC) and you made manual updates to your infrastructure that resulted in an undesired configuration. You are unable to effectively track and revert the manual changes. 
+  Because you have not measured increased frequency of your deployments, your team is not incentivized to reduce the size of their changes and improve their rollback plans for each change, leading to more risk and increased failure rates. 
+  You do not measure the total duration of an outage caused by unsuccessful changes. Your team is unable to prioritize and improve its deployment process and recovery plan effectiveness. 

 **Benefits of establishing this best practice:** Having a plan to recover from unsuccessful changes minimizes the mean time to recover (MTTR) and reduces your business impact. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 A consistent, documented policy and practice adopted by release teams allows an organization to plan what should happen if unsuccessful changes occur. The policy should allow for fixing forward in specific circumstances. In either situation, a fix forward or rollback plan should be well documented and tested before deployment to live production so that the time it takes to revert a change is minimized. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Document the policies that require teams to have effective plans to reverse changes within a specified period. 

   1.  Policies should specify when a fix-forward situation is allowed. 

   1.  Require a documented rollback plan to be accessible by all involved. 

   1.  Specify the requirements to rollback (for example, when it is found that unauthorized changes have been deployed). 

1.  Analyze the level of impact of all changes related to each component of a workload. 

   1.  Allow repeatable changes to be standardized, templated, and preauthorized if they follow a consistent workflow that enforces change policies. 

   1.  Reduce the potential impact of any change by making the size of the change smaller so recovery takes less time and causes less business impact. 

   1.  Ensure rollback procedures revert code to the known good state to avoid incidents where possible. 

1.  Integrate tools and workflows to enforce your policies programatically. 

1.  Make data about changes visible to other workload owners to improve the speed of diagnosis of any failed change that cannot be rolled back. 

   1.  Measure success of this practice using visible change data and identify iterative improvements. 

1.  Use monitoring tools to verify the success or failure of a deployment to speed up decision-making on rolling back. 

1.  Measure your duration of outage during an unsuccessful change to continually improve your recovery plans. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Ensuring Rollback Safety During Deployments ](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/)
+ [AWS Whitepaper \$1 Change Management in the Cloud ](https://docs.aws.amazon.com/whitepapers/latest/change-management-in-the-cloud/change-management-in-the-cloud.html)

 **Related videos:** 
+ [ re:Invent 2019 \$1 Amazon’s approach to high-availability deployment ](https://aws.amazon.com/builders-library/amazon-approach-to-high-availability-deployment/)

# OPS06-BP02 Test deployments
<a name="ops_mit_deploy_risks_test_val_chg"></a>

 Test release procedures in pre-production by using the same deployment configuration, security controls, steps, and procedures as in production. Validate that all deployed steps are completed as expected, such as inspecting files, configurations, and services. Further test all changes with functional, integration, and load tests, along with any monitoring such as health checks. By doing these tests, you can identify deployment issues early with an opportunity to plan and mitigate them prior to production. 

 You can create temporary parallel environments for testing every change. Automate the deployment of the test environments using infrastructure as code (IaC) to help reduce amount of work involved and ensure stability, consistency, and faster feature delivery. 

 **Desired outcome:** Your organization adopts a test-driven development culture that includes testing deployments. This ensures teams are focused on delivering business value rather than managing releases. Teams are engaged early upon identification of deployment risks to determine the appropriate course of mitigation. 

 **Common anti-patterns:** 
+  During production releases, untested deployments cause frequent issues that require troubleshooting and escalation. 
+  Your release contains infrastructure as code (IaC) that updates existing resources. You are unsure if the IaC runs successfully or causes impact to the resources. 
+  You deploy a new feature to your application. It doesn't work as intended and there is no visibility until it gets reported by impacted users. 
+  You update your certificates. You accidentally install the certificates to the wrong components, which goes undetected and impacts website visitors because a secure connection to the website can't be established. 

 **Benefits of establishing this best practice:** Extensive testing in pre-production of deployment procedures, and the changes introduced by them, minimizes the potential impact to production caused by the deployments steps. This increases confidence during production release and minimizes operational support without slowing down velocity of the changes being delivered. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Testing your deployment process is as important as testing the changes that result from your deployment. This can be achieved by testing your deployment steps in a pre-production environment that mirrors production as closely as possible. Common issues, such as incomplete or incorrect deployment steps, or misconfigurations, can be caught as a result before going to production. In addition, you can test your recovery steps. 

 **Customer example** 

 As part of their continuous integration and continuous delivery (CI/CD) pipeline, AnyCompany Retail performs the defined steps needed to release infrastructure and software updates for its customers in a production-like environment. The pipeline is comprised of pre-checks to detect drift (detecting changes to resources performed outside of your IaC) in resources prior to deployment, as well as validate actions that the IaC takes upon its initiation. It validates deployment steps, like verifying that certain files and configurations are in place and services are in running states and are responding correctly to health checks on local host before re-registering with the load balancer. Additionally, all changes flag a number of automated tests, such as functional, security, regression, integration, and load tests. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Perform pre-install checks to mirror the pre-production environment to production. 

   1.  Use [drift detection](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift.html) to detect when resources have been changed outside of CloudFormation. 

   1.  Use [change sets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html) to validate that the intent of a stack update matches the actions that CloudFormation takes when the change set is initiated. 

1.  This triggers a manual approval step in [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/approvals.html) to authorize the deployment to the pre-production environment. 

1.  Use deployment configurations such as [AWS CodeDeploy AppSpec](https://docs.aws.amazon.com/codedeploy/latest/userguide/application-specification-files.html) files to define deployment and validation steps. 

1.  Where applicable, [integrate AWS CodeDeploy with other AWS services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-aws.html) or [integrate AWS CodeDeploy with partner product and services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-partners.html). 

1.  [Monitor deployments](https://docs.aws.amazon.com/codedeploy/latest/userguide/monitoring.html) using Amazon CloudWatch, AWS CloudTrail, and Amazon SNS event notifications. 

1.  Perform post-deployment automated testing, including functional, security, regression, integration, and load testing. 

1.  [Troubleshoot](https://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html) deployment issues. 

1.  Successful validation of preceding steps should initiate a manual approval workflow to authorize deployment to production. 

 **Level of effort for the implementation plan:** High 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md) 

 **Related documents:** 
+ [AWS Builders' Library \$1 Automating safe, hands-off deployments \$1 Test Deployments ](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/#Test_deployments_in_pre-production_environments)
+ [AWS Whitepaper \$1 Practicing Continuous Integration and Continuous Delivery on AWS](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
+ [ The Story of Apollo - Amazon's Deployment Engine ](https://www.allthingsdistributed.com/2014/11/apollo-amazon-deployment-engine.html)
+  [How to test and debug AWS CodeDeploy locally before you ship your code](https://aws.amazon.com/blogs/devops/how-to-test-and-debug-aws-codedeploy-locally-before-you-ship-your-code/) 
+ [ Integrating Network Connectivity Testing with Infrastructure Deployment ](https://aws.amazon.com/blogs/networking-and-content-delivery/integrating-network-connectivity-testing-with-infrastructure-deployment/)

 **Related videos:** 
+ [ re:Invent 2020 \$1 Testing software and systems at Amazon ](https://www.youtube.com/watch?v=o1sc3cK9bMU)

 **Related examples:** 
+ [ Tutorial \$1 Deploy and Amazon ECS service with a validation test ](https://docs.aws.amazon.com/codedeploy/latest/userguide/tutorial-ecs-deployment-with-hooks.html)

# OPS06-BP03 Employ safe deployment strategies
<a name="ops_mit_deploy_risks_deploy_mgmt_sys"></a>

 Safe production roll-outs control the flow of beneficial changes with an aim to minimize any perceived impact for customers from those changes. The safety controls provide inspection mechanisms to validate desired outcomes and limit the scope of impact from any defects introduced by the changes or from deployment failures. Safe roll-outs may include strategies such as feature-flags, one-box, rolling (canary releases), immutable, traffic splitting, and blue/green deployments. 

 **Desired outcome:** Your organization uses a continuous integration continuous delivery (CI/CD) system that provides capabilities for automating safe rollouts. Teams are required to use appropriate safe roll-out strategies. 

 **Common anti-patterns:** 
+  You deploy an unsuccessful change to all of production all at once. As a result, all customers are impacted simultaneously. 
+  A defect introduced in a simultaneous deployment to all systems requires an emergency release. Correcting it for all customers takes several days. 
+  Managing production release requires planning and participation of several teams. This puts constraints on your ability to frequently update features for your customers. 
+  You perform a mutable deployment by modifying your existing systems. After discovering that the change was unsuccessful, you are forced to modify the systems again to restore the old version, extending your time to recovery. 

 **Benefits of establishing this best practice:** Automated deployments balance speed of roll-outs against delivering beneficial changes consistently to customers. Limiting impact prevents costly deployment failures and maximizes teams ability to efficiently respond to failures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Continuous-delivery failures can lead to reduced service availability and bad customer experiences. To maximize the rate of successful deployments, implement safety controls in the end-to-end release process to minimize deployment errors, with a goal of achieving zero deployment failures. 

 **Customer example** 

 AnyCompany Retail is on a mission to achieve minimal to zero downtime deployments, meaning that there's no perceivable impact to its users during deployments. To accomplish this, the company has established deployment patterns (see the following workflow diagram), such as rolling and blue/green deployments. All teams adopt one or more of these patterns in their CI/CD pipeline. 


| CodeDeploy workflow for Amazon EC2 | CodeDeploy workflow for Amazon ECS | CodeDeploy workflow for Lambda | 
| --- | --- | --- | 
|  ![\[Deployment process flow for Amazon EC2\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/deployment-process-ec2.png)  |  ![\[Deployment process flow for Amazon ECS\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/deployment-process-ecs.png)  |  ![\[Deployment process flow for Lambda\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/deployment-process-lambda.png)  | 

### Implementation steps
<a name="implementation-steps"></a>

1.  Use an approval workflow to initiate the sequence of production roll-out steps upon promotion to production . 

1.  Use an automated deployment system such as [AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html). AWS CodeDeploy [deployment options](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps.html) include in-place deployments for EC2/On-Premises and blue/green deployments for EC2/On-Premises, AWS Lambda, and Amazon ECS (see the preceding workflow diagram). 

   1.  Where applicable, [integrate AWS CodeDeploy with other AWS services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-aws.html) or [integrate AWS CodeDeploy with partner product and services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-partners.html). 

1.  Use blue/green deployments for databases such as [Amazon Aurora](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/blue-green-deployments.html) and [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-green-deployments.html). 

1.  [Monitor deployments](https://docs.aws.amazon.com/codedeploy/latest/userguide/monitoring.html) using Amazon CloudWatch, AWS CloudTrail, and Amazon SNS event notifications. 

1.  Perform post-deployment automated testing including functional, security, regression, integration, and any load tests. 

1.  [Troubleshoot](https://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html) deployment issues. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md) 
+  [OPS05-BP09 Make frequent, small, reversible changes](ops_dev_integ_freq_sm_rev_chg.md) 
+  [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Automating safe, hands-off deployments \$1 Production deployments ](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/?did=ba_card&trk=ba_card#Production_deployments)
+ [AWS Builders Library \$1 My CI/CD pipeline is my release captain \$1 Safe, automatic production releases](https://aws.amazon.com//builders-library/cicd-pipeline/#Safe.2C_automatic_production_releases)
+ [AWS Whitepaper \$1 Practicing Continuous Integration and Continuous Delivery on AWS \$1 Deployment methods](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/deployment-methods.html)
+ [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html)
+ [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html)
+ [Set up an API Gateway canary release deployment ](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html)
+ [Amazon ECS Deployment Types](https://docs.aws.amazon.com/)
+ [Fully Managed Blue/Green Deployments in Amazon Aurora and Amazon RDS](https://aws.amazon.com/blogs/aws/new-fully-managed-blue-green-deployments-in-amazon-aurora-and-amazon-rds/)
+ [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html)

 **Related videos:** 
+ [re:Invent 2020 \$1 Hands-off: Automating continuous delivery pipelines at Amazon](https://www.youtube.com/watch?v=ngnMj1zbMPY)
+ [re:Invent 2019 \$1 Amazon's Approach to high-availability deployment](https://www.youtube.com/watch?v=bCgD2bX1LI4)

 **Related examples:** 
+ [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html)
+ [Worlshop \$1 Buiding CI/CD pipelines for Lambda canary deployments using AWS CDK](https://catalog.us-east-1.prod.workshops.aws/workshops/5195ab7c-5ded-4ee2-a1c5-775300717f42/en-US)
+ [Workshop \$1 Blue/Green and Canary Deployment for EKS and ECS](https://catalog.us-east-1.prod.workshops.aws/workshops/2175d94a-cd79-4ed2-8e7e-1f0dd1956a3a/en-US)
+ [Workshop \$1 Building a Cross-account CI/CD Pipeline](https://catalog.us-east-1.prod.workshops.aws/workshops/00bc829e-fd7c-4204-9da1-faea3cf8bd88/en-US)

# OPS06-BP04 Automate testing and rollback
<a name="ops_mit_deploy_risks_auto_testing_and_rollback"></a>

 To increase the speed, reliability, and confidence of your deployment process, have a strategy for automated testing and rollback capabilities in pre-production and production environments. Automate testing when deploying to production to simulate human and system interactions that verify the changes being deployed. Automate rollback to revert back to a previous known good state quickly. The rollback should be initiated automatically on pre-defined conditions such as when the desired outcome of your change is not achieved or when the automated test fails. Automating these two activities improves your success rate for your deployments, minimizes recovery time, and reduces the potential impact to the business. 

 **Desired outcome:** Your automated tests and rollback strategies are integrated into your continuous integration, continuous delivery (CI/CD) pipeline. Your monitoring is able to validate against your success criteria and initiate automatic rollback upon failure. This minimizes any impact to end users and customers. For example, when all testing outcomes have been satisfied, you promote your code into the production environment where automated regression testing is initiated, leveraging the same test cases. If regression test results do not match expectations, then automated rollback is initiated in the pipeline workflow. 

 **Common anti-patterns:** 
+  Your systems are not architected in a way that allows them to be updated with smaller releases. As a result, you have difficulty in reversing those bulk changes during a failed deployment. 
+  Your deployment process consists of a series of manual steps. After you deploy changes to your workload, you start post-deployment testing. After testing, you realize that your workload is inoperable and customers are disconnected. You then begin rolling back to the previous version. All of these manual steps delay overall system recovery and cause a prolonged impact to your customers. 
+  You spent time developing automated test cases for functionality that is not frequently used in your application, minimizing the return on investment in your automated testing capability. 
+  Your release is comprised of application, infrastructure, patches and configuration updates that are independent from one another. However, you have a single CI/CD pipeline that delivers all changes at once. A failure in one component forces you to revert all changes, making your rollback complex and inefficient. 
+  Your team completes the coding work in sprint one and begins sprint two work, but your plan did not include testing until sprint three. As a result, automated tests revealed defects from sprint one that had to be resolved before testing of sprint two deliverables could be started and the entire release is delayed, devaluing your automated testing. 
+  Your automated regression test cases for the production release are complete, but you are not monitoring workload health. Since you have no visibility into whether or not the service has restarted, you are not sure if rollback is needed or if it has already occurred. 

 **Benefits of establishing this best practice:** Automated testing increases the transparency of your testing process and your ability to cover more features in a shorter time period. By testing and validating changes in production, you are able to identify issues immediately. Improvement in consistency with automated testing tools allows for better detection of defects. By automatically rolling back to the previous version, the impact on your customers is minimized. Automated rollback ultimately inspires more confidence in your deployment capabilities by reducing business impact. Overall, these capabilities reduce time-to-delivery while ensuring quality. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Automate testing of deployed environments to confirm desired outcomes more quickly. Automate rollback to a previous known good state when pre-defined outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes. Integrate testing tools with your pipeline workflow to consistently test and minimize manual inputs. Prioritize automating test cases, such as those that mitigate the greatest risks and need to be tested frequently with every change. Additionally, automate rollback based on specific conditions that are pre-defined in your test plan. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Establish a testing lifecycle for your development lifecycle that defines each stage of the testing process from requirements planning to test case development, tool configuration, automated testing, and test case closure. 

   1.  Create a workload-specific testing approach from your overall test strategy. 

   1.  Consider a continuous testing strategy where appropriate throughout the development lifecycle. 

1.  Select automated tools for testing and rollback based on your business requirements and pipeline investments. 

1.  Decide which test cases you wish to automate and which should be performed manually. These can be defined based on business value priority of the feature being tested. Align all team members to this plan and verify accountability for performing manual tests. 

   1.  Apply automated testing capabilities to specific test cases that make sense for automation, such as repeatable or frequently run cases, those that require repetitive tasks, or those that are required across multiple configurations. 

   1.  Define test automation scripts as well as the success criteria in the automation tool so continued workflow automation can be initiated when specific cases fail. 

   1.  Define specific failure criteria for automated rollback. 

1.  Prioritize test automation to drive consistent results with thorough test case development where complexity and human interaction have a higher risk of failure. 

1.  Integrate your automated testing and rollback tools into your CI/CD pipeline. 

   1.  Develop clear success criteria for your changes. 

   1.  Monitor and observe to detect these criteria and automatically reverse changes when specific rollback criteria are met. 

1.  Perform different types of automated production testing, such as: 

   1.  A/B testing to show results in comparison to the current version between two user testing groups. 

   1.  Canary testing that allows you to roll out your change to a subset of users before releasing it to all. 

   1.  Feature-flag testing which allows a single feature of the new version at a time to be flagged on and off from outside the application so that each new feature can be validated one at a time. 

   1.  Regression testing to verify new functionality with existing interrelated components. 

1.  Monitor the operational aspects of the application, transactions, and interactions with other applications and components. Develop reports to show success of changes by workload so that you can identify what parts of the automation and workflow can be further optimized. 

   1.  Develop test result reports that help you make quick decisions on whether or not rollback procedures should be invoked. 

   1.  Implement a strategy that allows for automated rollback based upon pre-defined failure conditions that result from one or more of your test methods. 

1.  Develop your automated test cases to allow for reusability across future repeatable changes. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Ensuring rollback safety during deployments ](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/)
+  [Redeploy and rollback a deployment with AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployments-rollback-and-redeploy.html) 
+ [ 8 best practices when automating your deployments with AWS CloudFormation](https://aws.amazon.com/blogs/infrastructure-and-automation/best-practices-automating-deployments-with-aws-cloudformation/)

 **Related examples:** 
+ [ Serverless UI testing using Selenium, AWS Lambda, AWS Fargate, and AWS Developer Tools ](https://aws.amazon.com/blogs/devops/using-aws-codepipeline-aws-codebuild-and-aws-lambda-for-serverless-automated-ui-testing/)

 **Related videos:** 
+ [ re:Invent 2020 \$1 Hands-off: Automating continuous delivery pipelines at Amazon ](https://www.youtube.com/watch?v=ngnMj1zbMPY)
+ [ re:Invent 2019 \$1 Amazon's Approach to high-availability deployment ](https://www.youtube.com/watch?v=bCgD2bX1LI4)

# OPS 7. How do you know that you are ready to support a workload?
<a name="ops-07"></a>

 Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand the operational risks related to your workload. 

**Topics**
+ [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md)
+ [OPS07-BP02 Ensure a consistent review of operational readiness](ops_ready_to_support_const_orr.md)
+ [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md)
+ [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md)
+ [OPS07-BP05 Make informed decisions to deploy systems and changes](ops_ready_to_support_informed_deploy_decisions.md)
+ [OPS07-BP06 Create support plans for production workloads](ops_ready_to_support_enable_support_plans.md)

# OPS07-BP01 Ensure personnel capability
<a name="ops_ready_to_support_personnel_capability"></a>

Have a mechanism to validate that you have the appropriate number of trained personnel to support the workload. They must be trained on the platform and services that make up your workload. Provide them with the knowledge necessary to operate the workload. You must have enough trained personnel to support the normal operation of the workload and troubleshoot any incidents that occur. Have enough personnel so that you can rotate during on-call and vacations to avoid burnout. 

 **Desired outcome:** 
+  There are enough trained personnel to support the workload at times when the workload is available. 
+  You provide training for your personnel on the software and services that make up your workload. 

 **Common anti-patterns:** 
+ Deploying a workload without team members trained to operate the platform and services in use. 
+  Not having enough personnel to support on-call rotations or personnel taking time off. 

 **Benefits of establishing this best practice:** 
+  Having skilled team members helps effective support of your workload. 
+  With enough team members, you can support the workload and on-call rotations while decreasing the risk of burnout. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Validate that there are sufficient trained personnel to support the workload. Verify that you have enough team members to cover normal operational activities, including on-call rotations. 

 **Customer example** 

 AnyCompany Retail makes sure that teams supporting the workload are properly staffed and trained. They have enough engineers to support an on-call rotation. Personnel get training on the software and platform that the workload is built on and are encouraged to earn certifications. There are enough personnel so that people can take time off while still supporting the workload and the on-call rotation. 

 **Implementation steps** 

1.  Assign an adequate number of personnel to operate and support your workload, including on-call duties. 

1.  Train your personnel on the software and platforms that compose your workload. 

   1.  [AWS Training and Certification](https://aws.amazon.com/training/) has a library of courses about AWS. They provide free and paid courses, online and in-person. 

   1.  [AWS hosts events and webinars](https://aws.amazon.com/events/) where you learn from AWS experts. 

1.  Regularly evaluate team size and skills as operating conditions and the workload change. Adjust team size and skills to match operational requirements. 

 **Level of effort for the implementation plan:** High. Hiring and training a team to support a workload can take significant effort but has substantial long-term benefits. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - Team members must have the information necessary to operate and support the workload. Knowledge management is the key to providing that. 

 **Related documents:** 
+  [AWS Events and Webinars](https://aws.amazon.com/events/) 
+  [AWS Training and Certification](https://aws.amazon.com/training/) 

# OPS07-BP02 Ensure a consistent review of operational readiness
<a name="ops_ready_to_support_const_orr"></a>

Use Operational Readiness Reviews (ORRs) to validate that you can operate your workload. ORR is a mechanism developed at Amazon to validate that teams can safely operate their workloads. An ORR is a review and inspection process using a checklist of requirements. An ORR is a self-service experience that teams use to certify their workloads. ORRs include best practices from lessons learned from our years of building software. 

 An ORR checklist is composed of architectural recommendations, operational process, event management, and release quality. Our Correction of Error (CoE) process is a major driver of these items. Your own post-incident analysis should drive the evolution of your own ORR. An ORR is not only about following best practices but preventing the recurrence of events that you’ve seen before. Lastly, security, governance, and compliance requirements can also be included in an ORR. 

 Run ORRs before a workload launches to general availability and then throughout the software development lifecycle. Running the ORR before launch increases your ability to operate the workload safely. Periodically re-run your ORR on the workload to catch any drift from best practices. You can have ORR checklists for new services launches and ORRs for periodic reviews. This helps keep you up to date on new best practices that arise and incorporate lessons learned from post-incident analysis. As your use of the cloud matures, you can build ORR requirements into your architecture as defaults. 

 **Desired outcome:**  You have an ORR checklist with best practices for your organization. ORRs are conducted before workloads launch. ORRs are run periodically over the course of the workload lifecycle. 

 **Common anti-patterns:** 
+ You launch a workload without knowing if you can operate it. 
+ Governance and security requirements are not included in certifying a workload for launch. 
+ Workloads are not re-evaluated periodically. 
+ Workloads launch without required procedures in place. 
+ You see repetition of the same root cause failures in multiple workloads. 

 **Benefits of establishing this best practice:** 
+  Your workloads include architecture, process, and management best practices. 
+  Lessons learned are incorporated into your ORR process. 
+  Required procedures are in place when workloads launch. 
+  ORRs are run throughout the software lifecycle of your workloads. 

 **Level of risk if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 An ORR is two things: a process and a checklist. Your ORR process should be adopted by your organization and supported by an executive sponsor. At a minimum, ORRs must be conducted before a workload launches to general availability. Run the ORR throughout the software development lifecycle to keep it up to date with best practices or new requirements. The ORR checklist should include configuration items, security and governance requirements, and best practices from your organization. Over time, you can use services, such as [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html), [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html), and [AWS Control Tower Guardrails](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html), to build best practices from the ORR into guardrails for automatic detection of best practices. 

 **Customer example** 

 After several production incidents, AnyCompany Retail decided to implement an ORR process. They built a checklist composed of best practices, governance and compliance requirements, and lessons learned from outages. New workloads conduct ORRs before they launch. Every workload conducts a yearly ORR with a subset of best practices to incorporate new best practices and requirements that are added to the ORR checklist. Over time, AnyCompany Retail used [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to detect some best practices, speeding up the ORR process. 

 **Implementation steps** 

 To learn more about ORRs, read the [Operational Readiness Reviews (ORR) whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html). It provides detailed information on the history of the ORR process, how to build your own ORR practice, and how to develop your ORR checklist. The following steps are an abbreviated version of that document. For an in-depth understanding of what ORRs are and how to build your own, we recommend reading that whitepaper. 

1. Gather the key stakeholders together, including representatives from security, operations, and development. 

1. Have each stakeholder provide at least one requirement. For the first iteration, try to limit the number of items to thirty or less. 
   +  [Appendix B: Example ORR questions](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/appendix-b-example-orr-questions.html) from the Operational Readiness Reviews (ORR) whitepaper contains sample questions that you can use to get started. 

1. Collect your requirements into a spreadsheet. 
   + You can use [custom lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) in the [AWS Well-Architected Tool](https://console.aws.amazon.com/wellarchiected/) to develop your ORR and share them across your accounts and AWS Organization. 

1. Identify one workload to conduct the ORR on. A pre-launch workload or an internal workload is ideal. 

1. Run through the ORR checklist and take note of any discoveries made. Discoveries might not be ok if a mitigation is in place. For any discovery that lacks a mitigation, add those to your backlog of items and implement them before launch. 

1. Continue to add best practices and requirements to your ORR checklist over time. 

 Support customers with Enterprise Support can request the [Operational Readiness Review Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. The workshop is an interactive *working backwards* session to develop your own ORR checklist. 

 **Level of effort for the implementation plan:** High. Adopting an ORR practice in your organization requires executive sponsorship and stakeholder buy-in. Build and update the checklist with inputs from across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) – Governance requirements are a natural fit for an ORR checklist. 
+ [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) – Compliance requirements are sometimes included in an ORR checklist. Other times they are a separate process. 
+ [OPS03-BP07 Resource teams appropriately](ops_org_culture_team_res_appro.md) – Team capability is a good candidate for an ORR requirement. 
+ [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) – A rollback or rollforward plan must be established before you launch your workload. 
+ [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) – To support a workload you must have the required personnel. 
+ [SEC01-BP03 Identify and validate control objectives](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_securely_operate_control_objectives.html) – Security control objectives make excellent ORR requirements. 
+ [REL13-BP01 Define recovery objectives for downtime and data loss](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_planning_for_recovery_objective_defined_recovery.html) – Disaster recovery plans are a good ORR requirement. 
+ [COST02-BP01 Develop policies based on your organization requirements](https://docs.aws.amazon.com/wellarchitected/latest/framework/cost_govern_usage_policies.html) – Cost management policies are good to include in your ORR checklist. 

 **Related documents:** 
+  [AWS Control Tower - Guardrails in AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html) 
+  [AWS Well-Architected Tool - Custom Lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) 
+  [Operational Readiness Review Template by Adrian Hornsby](https://medium.com/the-cloud-architect/operational-readiness-review-template-e23a4bfd8d79) 
+  [Operational Readiness Reviews (ORR) Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html) 

 **Related videos:** 
+  [AWS Supports You \$1 Building an Effective Operational Readiness Review (ORR)](https://www.youtube.com/watch?v=Keo6zWMQqS8) 

 **Related examples:** 
+  [Sample Operational Readiness Review (ORR) Lens](https://github.com/aws-samples/custom-lens-wa-sample/tree/main/ORR-Lens) 

 **Related services:** 
+  [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) 
+  [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html) 
+  [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html) 
+  [AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html) 

# OPS07-BP03 Use runbooks to perform procedures
<a name="ops_ready_to_support_use_runbooks"></a>

 A *runbook* is a documented process to achieve a specific outcome. Runbooks consist of a series of steps that someone follows to get something done. Runbooks have been used in operations going back to the early days of aviation. In cloud operations, we use runbooks to reduce risk and achieve desired outcomes. At its simplest, a runbook is a checklist to complete a task. 

 Runbooks are an essential part of operating your workload. From onboarding a new team member to deploying a major release, runbooks are the codified processes that provide consistent outcomes no matter who uses them. Runbooks should be published in a central location and updated as the process evolves, as updating runbooks is a key component of a change management process. They should also include guidance on error handling, tools, permissions, exceptions, and escalations in case a problem occurs. 

 As your organization matures, begin automating runbooks. Start with runbooks that are short and frequently used. Use scripting languages to automate steps or make steps easier to perform. As you automate the first few runbooks, you’ll dedicate time to automating more complex runbooks. Over time, most of your runbooks should be automated in some way. 

 **Desired outcome:** Your team has a collection of step-by-step guides for performing workload tasks. The runbooks contain the desired outcome, necessary tools and permissions, and instructions for error handling. They are stored in a central location and updated frequently. 

 **Common anti-patterns:** 
+  Relying on memory to complete each step of a process. 
+  Manually deploying changes without a checklist. 
+  Different team members performing the same process but with different steps or outcomes. 
+  Letting runbooks drift out of sync with system changes and automation. 

 **Benefits of establishing this best practice:** 
+  Reducing error rates for manual tasks. 
+  Operations are performed in a consistent manner. 
+  New team members can start performing tasks sooner. 
+  Runbooks can be automated to reduce toil. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Runbooks can take several forms depending on the maturity level of your organization. At a minimum, they should consist of a step-by-step text document. The desired outcome should be clearly indicated. Clearly document necessary special permissions or tools. Provide detailed guidance on error handling and escalations in case something goes wrong. List the runbook owner and publish it in a central location. Once your runbook is documented, validate it by having someone else on your team run it. As procedures evolve, update your runbooks in accordance with your change management process. 

 Your text runbooks should be automated as your organization matures. Using services like [AWS Systems Manager automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), you can transform flat text into automations that can be run against your workload. These automations can be run in response to events, reducing the operational burden to maintain your workload. 

 **Customer example** 

 AnyCompany Retail must perform database schema updates during software deployments. The Cloud Operations Team worked with the Database Administration Team to build a runbook for manually deploying these changes. The runbook listed each step in the process in checklist form. It included a section on error handling in case something went wrong. They published the runbook on their internal wiki along with their other runbooks. The Cloud Operations Team plans to automate the runbook in a future sprint. 

## Implementation steps
<a name="implementation-steps"></a>

 If you don’t have an existing document repository, a version control repository is a great place to start building your runbook library. You can build your runbooks using Markdown. We have provided an example runbook template that you can use to start building runbooks. 

```
# Runbook Title
## Runbook Info
| Runbook ID | Description | Tools Used | Special Permissions | Runbook Author | Last Updated | Escalation POC | 
|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this runbook for? What is the desired outcome? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name |
## Steps
1. Step one
2. Step two
```

1.  If you don’t have an existing documentation repository or wiki, create a new version control repository in your version control system. 

1.  Identify a process that does not have a runbook. An ideal process is one that is conducted semiregularly, short in number of steps, and has low impact failures. 

1.  In your document repository, create a new draft Markdown document using the template. Fill in `Runbook Title` and the required fields under `Runbook Info`. 

1.  Starting with the first step, fill in the `Steps` portion of the runbook. 

1.  Give the runbook to a team member. Have them use the runbook to validate the steps. If something is missing or needs clarity, update the runbook. 

1.  Publish the runbook to your internal documentation store. Once published, tell your team and other stakeholders. 

1.  Over time, you’ll build a library of runbooks. As that library grows, start working to automate runbooks. 

 **Level of effort for the implementation plan:** Low. The minimum standard for a runbook is a step-by-step text guide. Automating runbooks can increase the implementation effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md): Runbooks should have an owner in charge of maintaining them. 
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Runbooks and playbooks are like each other with one key difference: a runbook has a desired outcome. In many cases runbooks are initiated once a playbook has identified a root cause. 
+  [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md): Runbooks are a part of a good event, incident, and problem management practice. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Runbooks and playbooks should be used to respond to alerts. Over time these reactions should be automated. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md): Maintaining runbooks is a key part of knowledge management. 

 **Related documents:** 
+ [Achieving Operational Excellence using automated playbook and runbook](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/) 
+ [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+ [Migration playbook for AWS large migrations - Task 4: Improving your migration runbooks](https://docs.aws.amazon.com/prescriptive-guidance/latest/large-migration-migration-playbook/task-four-migration-runbooks.html) 
+ [Use AWS Systems Manager Automation runbooks to resolve operational tasks](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/) 

 **Related videos:** 
+  [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response (SEC318-R1)](https://www.youtube.com/watch?v=E1NaYN_fJUo) 
+  [How to automate IT Operations on AWS \$1 Amazon Web Services](https://www.youtube.com/watch?v=GuWj_mlyTug) 
+  [Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE) 

 **Related examples:** 
+  [AWS Systems Manager: Automation walkthroughs](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html) 
+  [AWS Systems Manager: Restore a root volume from the latest snapshot runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-document-sample-restore.html)
+  [Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake](https://catalog.us-east-1.prod.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US) 
+  [Gitlab - Runbooks](https://gitlab.com/gitlab-com/runbooks) 
+  [Rubix - A Python library for building runbooks in Jupyter Notebooks](https://github.com/Nurtch/rubix) 
+  [Using Document Builder to create a custom runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html) 
+  [Well-Architected Labs: Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/) 

 **Related services:** 
+  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 

# OPS07-BP04 Use playbooks to investigate issues
<a name="ops_ready_to_support_use_playbooks"></a>

 Playbooks are step-by-step guides used to investigate an incident. When incidents happen, playbooks are used to investigate, scope impact, and identify a root cause. Playbooks are used for a variety of scenarios, from failed deployments to security incidents. In many cases, playbooks identify the root cause that a runbook is used to mitigate. Playbooks are an essential component of your organization's incident response plans. 

 A good playbook has several key features. It guides the user, step by step, through the process of discovery. Thinking outside-in, what steps should someone follow to diagnose an incident? Clearly define in the playbook if special tools or elevated permissions are needed in the playbook. Having a communication plan to update stakeholders on the status of the investigation is a key component. In situations where a root cause can’t be identified, the playbook should have an escalation plan. If the root cause is identified, the playbook should point to a runbook that describes how to resolve it. Playbooks should be stored centrally and regularly maintained. If playbooks are used for specific alerts, provide your team with pointers to the playbook within the alert. 

 As your organization matures, automate your playbooks. Start with playbooks that cover low-risk incidents. Use scripting to automate the discovery steps. Make sure that you have companion runbooks to mitigate common root causes. 

 **Desired outcome:** Your organization has playbooks for common incidents. The playbooks are stored in a central location and available to your team members. Playbooks are updated frequently. For any known root causes, companion runbooks are built. 

 **Common anti-patterns:** 
+  There is no standard way to investigate an incident. 
+  Team members rely on muscle memory or institutional knowledge to troubleshoot a failed deployment. 
+  New team members learn how to investigate issues through trial and error. 
+  Best practices for investigating issues are not shared across teams. 

 **Benefits of establishing this best practice:** 
+  Playbooks boost your efforts to mitigate incidents. 
+  Different team members can use the same playbook to identify a root cause in a consistent manner. 
+  Known root causes can have runbooks developed for them, speeding up recovery time. 
+  Playbooks help team members to start contributing sooner. 
+  Teams can scale their processes with repeatable playbooks. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 How you build and use playbooks depends on the maturity of your organization. If you are new to the cloud, build playbooks in text form in a central document repository. As your organization matures, playbooks can become semi-automated with scripting languages like Python. These scripts can be run inside a Jupyter notebook to speed up discovery. Advanced organizations have fully automated playbooks for common issues that are auto-remediated with runbooks. 

 Start building your playbooks by listing common incidents that happen to your workload. Choose playbooks for incidents that are low risk and where the root cause has been narrowed down to a few issues to start. After you have playbooks for simpler scenarios, move on to the higher risk scenarios or scenarios where the root cause is not well known. 

 Your text playbooks should be automated as your organization matures. Using services like [AWS Systems Manager Automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), flat text can be transformed into automations. These automations can be run against your workload to speed up investigations. These automations can be activated in response to events, reducing the mean time to discover and resolve incidents. 

 Customers can use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to respond to incidents. This service provides a single interface to triage incidents, inform stakeholders during discovery and mitigation, and collaborate throughout the incident. It uses AWS Systems Manager Automations to speed up detection and recovery. 

 **Customer example** 

 A production incident impacted AnyCompany Retail. The on-call engineer used a playbook to investigate the issue. As they progressed through the steps, they kept the key stakeholders, identified in the playbook, up to date. The engineer identified the root cause as a race condition in a backend service. Using a runbook, the engineer relaunched the service, bringing AnyCompany Retail back online. 

## Implementation steps
<a name="implementation-steps"></a>

 If you don’t have an existing document repository, we suggest creating a version control repository for your playbook library. You can build your playbooks using Markdown, which is compatible with most playbook automation systems. If you are starting from scratch, use the following example playbook template. 

```
# Playbook Title
## Playbook Info
| Playbook ID | Description | Tools Used | Special Permissions | Playbook Author | Last Updated | Escalation POC | Stakeholders | Communication Plan |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this playbook for? What incident is it used for? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name | Stakeholder Name | How will updates be communicated during the investigation? |
## Steps
1. Step one
2. Step two
```

1.  If you don’t have an existing document repository or wiki, create a new version control repository for your playbooks in your version control system. 

1.  Identify a common issue that requires investigation. This should be a scenario where the root cause is limited to a few issues and resolution is low risk. 

1.  Using the Markdown template, fill in the `Playbook Name` section and the fields under `Playbook Info`. 

1.  Fill in the troubleshooting steps. Be as clear as possible on what actions to perform or what areas you should investigate. 

1.  Give a team member the playbook and have them go through it to validate it. If there’s anything missing or something isn’t clear, update the playbook. 

1.  Publish your playbook in your document repository and inform your team and any stakeholders. 

1.  This playbook library will grow as you add more playbooks. Once you have several playbooks, start automating them using tools like AWS Systems Manager Automations to keep automation and playbooks in sync. 

 **Level of effort for the implementation plan:** Low. Your playbooks should be text documents stored in a central location. More mature organizations will move towards automating playbooks. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md): Playbooks should have an owner in charge of maintaining them. 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Runbooks and playbooks are similar, but with one key difference: a runbook has a desired outcome. In many cases, runbooks are used once a playbook has identified a root cause. 
+  [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md): Playbooks are a part of good event, incident, and problem management practice. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Runbooks and playbooks should be used to respond to alerts. Over time, these reactions should be automated. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md): Maintaining playbooks is a key part of knowledge management. 

 **Related documents:** 
+ [ Achieving Operational Excellence using automated playbook and runbook ](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/)
+  [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+ [ Use AWS Systems Manager Automation runbooks to resolve operational tasks ](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/)

 **Related videos:** 
+ [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response (SEC318-R1) ](https://www.youtube.com/watch?v=E1NaYN_fJUo)
+ [AWS Systems Manager Incident Manager - AWS Virtual Workshops ](https://www.youtube.com/watch?v=KNOc0DxuBSY)
+ [ Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE)

 **Related examples:** 
+ [AWS Customer Playbook Framework ](https://github.com/aws-samples/aws-customer-playbook-framework)
+ [AWS Systems Manager: Automation walkthroughs ](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html)
+ [ Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake ](https://catalog.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US)
+ [ Rubix – A Python library for building runbooks in Jupyter Notebooks ](https://github.com/Nurtch/rubix)
+ [ Using Document Builder to create a custom runbook ](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html)
+ [ Well-Architected Labs: Automating operations with Playbooks and Runbooks ](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/)
+ [ Well-Architected Labs: Incident response playbook with Jupyter ](https://www.wellarchitectedlabs.com/security/300_labs/300_incident_response_playbook_with_jupyter-aws_iam/)

 **Related services:** 
+ [AWS Systems Manager Automation ](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html)
+ [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html)

# OPS07-BP05 Make informed decisions to deploy systems and changes
<a name="ops_ready_to_support_informed_deploy_decisions"></a>

Have processes in place for successful and unsuccessful changes to your workload. A pre-mortem is an exercise where a team simulates a failure to develop mitigation strategies. Use pre-mortems to anticipate failure and create procedures where appropriate. Evaluate the benefits and risks of deploying changes to your workload. Verify that all changes comply with governance. 

 **Desired outcome:** 
+  You make informed decisions when deploying changes to your workload. 
+  Changes comply with governance. 

 **Common anti-patterns:** 
+ Deploying a change to our workload without a process to handle a failed deployment.
+ Making changes to your production environment that are out of compliance with governance requirements.
+ Deploying a new version of your workload without establishing a baseline for resource utilization.

 **Benefits of establishing this best practice:** 
+  You are prepared for unsuccessful changes to your workload. 
+  Changes to your workload are compliant with governance policies. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use pre-mortems to develop processes for unsuccessful changes. Document your processes for unsuccessful changes. Ensure that all changes comply with governance. Evaluate the benefits and risks to deploying changes to your workload. 

 **Customer example** 

 AnyCompany Retail regularly conducts pre-mortems to validate their processes for unsuccessful changes. They document their processes in a shared Wiki and update it frequently. All changes comply with governance requirements. 

 **Implementation steps** 

1.  Make informed decisions when deploying changes to your workload. Establish and review criteria for a successful deployment. Develop scenarios or criteria that would initiate a rollback of a change. Weigh the benefits of deploying changes against the risks of an unsuccessful change. 

1.  Verify that all changes comply with governance policies. 

1.  Use pre-mortems to plan for unsuccessful changes and document mitigation strategies. Run a table-top exercise to model an unsuccessful change and validate roll-back procedures. 

 **Level of effort for the implementation plan:** Moderate. Implementing a practice of pre-mortems requires coordination and effort from stakeholders across your organization 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements are a key factor in determining whether to deploy a change. 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) - Establish plans to mitigate a failed deployment and use pre-mortems to validate them. 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) - Every software change should be properly tested before deployment in order to reduce defects in production. 
+  [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) - Having enough trained personnel to support the workload is essential to making an informed decision to deploy a system change. 

 **Related documents:** 
+ [ Amazon Web Services: Risk and Compliance ](https://docs.aws.amazon.com/whitepapers/latest/aws-risk-and-compliance/welcome.html)
+ [AWS Shared Responsibility Model ](https://aws.amazon.com/compliance/shared-responsibility-model/)
+ [ Governance in the AWS Cloud: The Right Balance Between Agility and Safety ](https://aws.amazon.com/blogs/apn/governance-in-the-aws-cloud-the-right-balance-between-agility-and-safety/)

# OPS07-BP06 Create support plans for production workloads
<a name="ops_ready_to_support_enable_support_plans"></a>

 Enable support for any software and services that your production workload relies on. Select an appropriate support level to meet your production service-level needs. Support plans for these dependencies are necessary in case there is a service disruption or software issue. Document support plans and how to request support for all service and software vendors. Implement mechanisms that verify that support points of contacts are kept up to date. 

 **Desired outcome:** 
+  Implement support plans for software and services that production workloads rely on. 
+  Choose an appropriate support plan based on service-level needs. 
+  Document the support plans, support levels, and how to request support. 

 **Common anti-patterns:** 
+  You have no support plan for a critical software vendor. Your workload is impacted by them and you can do nothing to expedite a fix or get timely updates from the vendor. 
+  A developer that was the primary point of contact for a software vendor left the company. You are not able to reach the vendor support directly. You must spend time researching and navigating generic contact systems, increasing the time required to respond when needed. 
+  A production outage occurs with a software vendor. There is no documentation on how to file a support case. 

 **Benefits of establishing this best practice:** 
+  With the appropriate support level, you are able to get a response in the time frame necessary to meet service-level needs. 
+  As a supported customer you can escalate if there are production issues. 
+  Software and services vendors can assist in troubleshooting during an incident. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Enable support plans for any software and services vendors that your production workload relies on. Set up appropriate support plans to meet service-level needs. For AWS customers, this means activating AWS Business Support or greater on any accounts where you have production workloads. Meet with support vendors on a regular cadence to get updates about support offerings, processes, and contacts. Document how to request support from software and services vendors, including how to escalate if there is an outage. Implement mechanisms to keep support contacts up to date. 

 **Customer example** 

 At AnyCompany Retail, all commercial software and services dependencies have support plans. For example, they have AWS Enterprise Support activated on all accounts with production workloads. Any developer can raise a support case when there is an issue. There is a wiki page with information on how to request support, whom to notify, and best practices for expediting a case. 

 **Implementation steps** 

1.  Work with stakeholders in your organization to identify software and services vendors that your workload relies on. Document these dependencies. 

1.  Determine service-level needs for your workload. Select a support plan that aligns with them. 

1.  For commercial software and services, establish a support plan with the vendors. 

   1.  Subscribing to AWS Business Support or greater for all production accounts provides faster response time from AWS Support and strongly recommended. If you don’t have premium support, you must have an action plan to handle issues, which require help from AWS Support. AWS Support provides a mix of tools and technology, people, and programs designed to proactively help you optimize performance, lower costs, and innovate faster. AWS Business Support provides additional benefits, including access to AWS Trusted Advisor and AWS Personal Health Dashboard and faster response times. 

1.  Document the support plan in your knowledge management tool. Include how to request support, who to notify if a support case is filed, and how to escalate during an incident. A wiki is a good mechanism to allow anyone to make necessary updates to documentation when they become aware of changes to support processes or contacts. 

 **Level of effort for the implementation plan:** Low. Most software and services vendors offer opt-in support plans. Documenting and sharing support best practices on your knowledge management system verifies that your team knows what to do when there is a production issue. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) 

 **Related documents:** 
+ [AWS Support Plans ](https://docs.aws.amazon.com/awssupport/latest/user/aws-support-plans.html)

 **Related services:** 
+ [AWS Business Support ](https://aws.amazon.com/premiumsupport/plans/business/)
+ [AWS Enterprise Support ](https://aws.amazon.com/premiumsupport/plans/enterprise/)

# Operate
<a name="a-operate"></a>

**Topics**
+ [OPS 8. How do you utilize workload observability in your organization?](ops-08.md)
+ [OPS 9. How do you understand the health of your operations?](ops-09.md)
+ [OPS 10. How do you manage workload and operations events?](ops-10.md)

# OPS 8. How do you utilize workload observability in your organization?
<a name="ops-08"></a>

Ensure optimal workload health by leveraging observability. Utilize relevant metrics, logs, and traces to gain a comprehensive view of your workload's performance and address issues efficiently.

**Topics**
+ [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md)
+ [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md)
+ [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md)
+ [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md)
+ [OPS08-BP05 Create dashboards](ops_workload_observability_create_dashboards.md)

# OPS08-BP01 Analyze workload metrics
<a name="ops_workload_observability_analyze_workload_metrics"></a>

 After implementing application telemetry, regularly analyze the collected metrics. While latency, requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to prioritize the review of business outcome metrics. This ensures you're making data-driven decisions aligned with your business objectives. 

 **Desired outcome:** Accurate insights into workload performance that drive data-informed decisions, ensuring alignment with business objectives. 

 **Common anti-patterns:** 
+  Analyzing metrics in isolation without considering their impact on business outcomes. 
+  Over-reliance on technical metrics while sidelining business metrics. 
+  Infrequent review of metrics, missing out on real-time decision-making opportunities. 

 **Benefits of establishing this best practice:** 
+  Enhanced understanding of the correlation between technical performance and business outcomes. 
+  Improved decision-making process informed by real-time data. 
+  Proactive identification and mitigation of issues before they affect business outcomes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Leverage tools like Amazon CloudWatch to perform metric analysis. AWS services such as CloudWatch anomaly detection and Amazon DevOps Guru can be used to detect anomalies, especially when static thresholds are unknown or when patterns of behavior are more suited for anomaly detection. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Analyze and review:** Regularly review and interpret your workload metrics. 

   1.  Prioritize business outcome metrics over purely technical metrics. 

   1.  Understand the significance of spikes, drops, or patterns in your data. 

1.  **Utilize Amazon CloudWatch:** Use Amazon CloudWatch for a centralized view and deep-dive analysis. 

   1.  Configure CloudWatch dashboards to visualize your metrics and compare them over time. 

   1.  Use [percentiles in CloudWatch](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/) to get a clear view of metric distribution, which can help in defining SLAs and understanding outliers. 

   1.  Set up [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to identify unusual patterns without relying on static thresholds. 

   1.  Implement [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html) to monitor and troubleshoot applications that span multiple accounts within a Region. 

   1.  Use [CloudWatch Metric Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html) to query and analyze metric data across accounts and Regions, identifying trends and anomalies. 

   1.  Apply [CloudWatch Metric Math](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) to transform, aggregate, or perform calculations on your metrics for deeper insights. 

1.  **Employ Amazon DevOps Guru:** Incorporate [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning-enhanced anomaly detection to identify early signs of operational issues for your serverless applications and remediate them before they impact your customers. 

1.  **Optimize based on insights:** Make informed decisions based on your metric analysis to adjust and improve your workloads. 

 **Level of effort for the Implementation Plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 

 **Related documents:** 
+ [ The Wheel Blog - Emphasizing the importance of continually reviewing metrics ](https://aws.amazon.com/blogs/opensource/the-wheel/)
+ [ Percentile are important ](https://aws-observability.github.io/observability-best-practices/guides/operational/business/sla-percentile/)
+ [ Using AWS Cost Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html)
+ [ CloudWatch cross-account observability ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html)
+ [ Query your metrics with CloudWatch Metrics Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html)

 **Related videos:** 
+ [ Enable Cross-Account Observability in Amazon CloudWatch ](https://www.youtube.com/watch?v=lUaDO9dqISc)
+ [ Introduction to Amazon DevOps Guru ](https://www.youtube.com/watch?v=2uA8q-8mTZY)
+ [ Continuously Analyze Metrics using AWS Cost Anomaly Detection](https://www.youtube.com/watch?v=IpQYBuay5OE)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Gaining operation insights with AIOps using Amazon DevOps Guru ](https://catalog.us-east-1.prod.workshops.aws/workshops/f92df379-6add-4101-8b4b-38b788e1222b/en-US)

# OPS08-BP02 Analyze workload logs
<a name="ops_workload_observability_analyze_workload_logs"></a>

 Regularly analyzing workload logs is essential for gaining a deeper understanding of the operational aspects of your application. By efficiently sifting through, visualizing, and interpreting log data, you can continually optimize application performance and security. 

 **Desired outcome:** Rich insights into application behavior and operations derived from thorough log analysis, ensuring proactive issue detection and mitigation. 

 **Common anti-patterns:** 
+ Neglecting the analysis of logs until a critical issue arises.
+ Not using the full suite of tools available for log analysis, missing out on critical insights.
+  Solely relying on manual review of logs without leveraging automation and querying capabilities. 

 **Benefits of establishing this best practice:** 
+ Proactive identification of operational bottlenecks, security threats, and other potential issues.
+ Efficient utilization of log data for continuous application optimization.
+  Enhanced understanding of application behavior, aiding in debugging and troubleshooting. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) is a powerful tool for log analysis. Integrated features like CloudWatch Logs Insights and Contributor Insights make the process of deriving meaningful information from logs intuitive and efficient. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Set up CloudWatch Logs:** Configure applications and services to send logs to CloudWatch Logs. 

1.  **Set up CloudWatch Logs Insights:** Use [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) to interactively search and analyze your log data. 

   1.  Craft queries to extract patterns, visualize log data, and derive actionable insights. 

1.  **Leverage Contributor Insights:** Use [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html) to identify top talkers in high cardinality dimensions like IP addresses or user-agents. 

1.  **Implement CloudWatch Logs metric filters:** Configure [CloudWatch log metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) to convert log data into actionable metrics. This allows you to set alarms or further analyze patterns. 

1.  **Regular review and refinement:** Periodically review your log analysis strategies to capture all relevant information and continually optimize application performance. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 

 **Related documents:** 
+ [ Analyzing Log Data with CloudWatch Logs Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html)
+ [ Using CloudWatch Contributor Insights ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html)
+ [ Creating and Managing CloudWatch Logs Log Metric Filters ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html)

 **Related videos:** 
+ [ Analyze Log Data with CloudWatch Logs Insights ](https://www.youtube.com/watch?v=2s2xcwm8QrM)
+ [ Use CloudWatch Contributor Insights to Analyze High-Cardinality Data ](https://www.youtube.com/watch?v=ErWRBLFkjGI)

 **Related examples:** 
+ [ CloudWatch Logs Sample Queries ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax-examples.html)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

# OPS08-BP03 Analyze workload traces
<a name="ops_workload_observability_analyze_workload_traces"></a>

 Analyzing trace data is crucial for achieving a comprehensive view of an application's operational journey. By visualizing and understanding the interactions between various components, performance can be fine-tuned, bottlenecks identified, and user experiences enhanced. 

 **Desired outcome:** Achieve clear visibility into your application's distributed operations, enabling quicker issue resolution and an enhanced user experience. 

 **Common anti-patterns:** 
+  Overlooking trace data, relying solely on logs and metrics. 
+  Not correlating trace data with associated logs. 
+  Ignoring the metrics derived from traces, such as latency and fault rates. 

 **Benefits of establishing this best practice:** 
+  Improve troubleshooting and reduce mean time to resolution (MTTR). 
+  Gain insights into dependencies and their impact. 
+  Swift identification and rectification of performance issues. 
+  Leveraging trace-derived metrics for informed decision-making. 
+  Improved user experiences through optimized component interactions. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) offers a comprehensive suite for trace data analysis, providing a holistic view of service interactions, monitoring user activities, and detecting performance issues. Features like ServiceLens, X-Ray Insights, X-Ray Analytics, and Amazon DevOps Guru enhance the depth of actionable insights derived from trace data. 

### Implementation steps
<a name="implementation-steps"></a>

 The following steps offer a structured approach to effectively implementing trace data analysis using AWS services: 

1.  **Integrate AWS X-Ray:** Ensure X-Ray is integrated with your applications to capture trace data. 

1.  **Analyze X-Ray metrics:** Delve into metrics derived from X-Ray traces such as latency, request rates, fault rates, and response time distributions using the [service map](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-servicemap.html#xray-console-servicemap-view) to monitor application health. 

1.  **Use ServiceLens:** Leverage the [ServiceLens map](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_service_map.html) for enhanced observability of your services and applications. This allows for integrated viewing of traces, metrics, logs, alarms, and other health information. 

1.  **Enable X-Ray Insights:** 

   1.  Turn on [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) for automated anomaly detection in traces. 

   1.  Examine insights to pinpoint patterns and ascertain root causes, such as increased fault rates or latencies. 

   1.  Consult the insights timeline for a chronological analysis of detected issues. 

1.  **Use X-Ray Analytics:** [X-Ray Analytics](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html) allows you to thoroughly explore trace data, pinpoint patterns, and extract insights. 

1.  **Use groups in X-Ray:** Create groups in X-Ray to filter traces based on criteria such as high latency, allowing for more targeted analysis. 

1.  **Incorporate Amazon DevOps Guru:** Engage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) to benefit from machine learning models pinpointing operational anomalies in traces. 

1.  **Use CloudWatch Synthetics:** Use [CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_tracing.html) to create canaries for continually monitoring your endpoints and workflows. These canaries can integrate with X-Ray to provide trace data for in-depth analysis of the applications being tested. 

1.  **Use Real User Monitoring (RUM):** With [AWS X-Ray and CloudWatch RUM](https://docs.aws.amazon.com/xray/latest/devguide/xray-services-RUM.html), you can analyze and debug the request path starting from end users of your application through downstream AWS managed services. This helps you identify latency trends and errors that impact your users. 

1.  **Correlate with logs:** Correlate [trace data with related logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/servicelens_troubleshooting.html#servicelens_troubleshooting_Nologs) within the X-Ray trace view for a granular perspective on application behavior. This allows you to view log events directly associated with traced transactions. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 

 **Related documents:** 
+ [ Using ServiceLens to Monitor Application Health ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html)
+ [ Exploring Trace Data with X-Ray Analytics ](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-analytics.html)
+ [ Detecting Anomalies in Traces with X-Ray Insights ](https://docs.aws.amazon.com/xray/latest/devguide/xray-insights.html)
+ [ Continuous Monitoring with CloudWatch Synthetics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)

 **Related videos:** 
+ [ Analyze and Debug Applications Using Amazon CloudWatch Synthetics and AWS X-Ray](https://www.youtube.com/watch?v=s2WvaV2eDO4)
+ [ Use AWS X-Ray Insights ](https://www.youtube.com/watch?v=tl8OWHl6jxw)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Implementing X-Ray with AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html)
+ [ CloudWatch Synthetics Canary Templates ](https://github.com/aws-samples/cloudwatch-synthetics-canary-terraform)

# OPS08-BP04 Create actionable alerts
<a name="ops_workload_observability_create_alerts"></a>

 Promptly detecting and responding to deviations in your application's behavior is crucial. Especially vital is recognizing when outcomes based on key performance indicators (KPIs) are at risk or when unexpected anomalies arise. Basing alerts on KPIs ensures that the signals you receive are directly tied to business or operational impact. This approach to actionable alerts promotes proactive responses and helps maintain system performance and reliability. 

 **Desired outcome:** Receive timely, relevant, and actionable alerts for rapid identification and mitigation of potential issues, especially when KPI outcomes are at risk. 

 **Common anti-patterns:** 
+  Setting up too many non-critical alerts, leading to alert fatigue. 
+  Not prioritizing alerts based on KPIs, making it hard to understand the business impact of issues. 
+  Neglecting to address root causes, leading to repetitive alerts for the same issue. 

 **Benefits of establishing this best practice:** 
+  Reduced alert fatigue by focusing on actionable and relevant alerts. 
+  Improved system uptime and reliability through proactive issue detection and mitigation. 
+  Enhanced team collaboration and quicker issue resolution by integrating with popular alerting and communication tools. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To create an effective alerting mechanism, it's vital to use metrics, logs, and trace data that flag when outcomes based on KPIs are at risk or anomalies are detected. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Determine key performance indicators (KPIs):** Identify your application's KPIs. Alerts should be tied to these KPIs to reflect the business impact accurately. 

1.  **Implement anomaly detection:** 
   +  **Use CloudWatch anomaly detection:** Set up [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to automatically detect unusual patterns, ensuring alerts are only generated for genuine anomalies. 
   +  **Use X-Ray Insights:** 

     1.  Set up [X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html) to detect anomalies in trace data. 

     1.  Configure [notifications for X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications) to be alerted on detected issues. 
   +  **Integrate with DevOps Guru:** 

     1.  Leverage [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) for its machine learning capabilities in detecting operational anomalies with existing data. 

     1.  Navigate to the [notification settings](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html#navigate-to-notification-settings) in DevOps Guru to set up anomaly alerts. 

1.  **Implement actionable alerts:** Design alerts that provide adequate information for immediate action. 

1.  **Reduce alarm fatigue:** Minimize non-critical alerts. Overwhelming teams with numerous insignificant alerts can lead to oversight of critical issues and diminish the overall effectiveness of the alerting mechanism. 

1.  **Set up composite alarms:** Use [Amazon CloudWatch composite alarms](https://aws.amazon.com/blogs/mt/improve-monitoring-efficiency-using-amazon-cloudwatch-composite-alarms-2/) to consolidate multiple alarms. 

1.  **Integrate with alerting tools:** Incorporate tools like [Ops Genie](https://www.atlassian.com/software/opsgenie) and [PagerDuty](https://www.pagerduty.com/). 

1.  **Engage Amazon Q Developer in chat applications** Integrate [Amazon Q Developer in chat applications](https://aws.amazon.com/chatbot/)to relay alerts to Chime, Microsoft Teams, and Slack. 

1.  **Alert based on logs:** Use [log metric filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) in CloudWatch to create alarms based on specific log events. 

1.  **Review and iterate:** Regularly revisit and refine alert configurations. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 

 **Related documents:** 
+ [ Using Amazon CloudWatch Alarms ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
+ [ Create a composite alarm ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Composite_Alarm.html)
+ [ Create a CloudWatch alarm based on anomaly detection ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create_Anomaly_Detection_Alarm.html)
+ [ DevOps Guru Notifications ](https://docs.aws.amazon.com/devops-guru/latest/userguide/update-notifications.html)
+ [ X-Ray Insights notifications ](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html#xray-console-insight-notifications)
+ [ OMonitor, operate, and troubleshoot your AWS resources with interactive ChatOps ](https://aws.amazon.com/chatbot/)
+ [ Amazon CloudWatch Integration Guide \$1 PagerDuty ](https://support.pagerduty.com/docs/amazon-cloudwatch-integration-guide)
+ [ Integrate OpsGenie with Amazon CloudWatch ](https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-amazon-cloudwatch/)

 **Related videos:** 
+ [ Create Composite Alarms in Amazon CloudWatch ](https://www.youtube.com/watch?v=0LMQ-Mu-ZCY)
+ [ Amazon Q Developer in chat applications Overview ](https://www.youtube.com/watch?v=0jUSEfHbTYk)
+ [AWS on Air ft. Mutative Commands in Amazon Q Developer in chat applications ](https://www.youtube.com/watch?v=u2pkw2vxrtk)

 **Related examples:** 
+ [ Alarms, incident management, and remediation in the cloud with Amazon CloudWatch ](https://aws.amazon.com/blogs/mt/alarms-incident-management-and-remediation-in-the-cloud-with-amazon-cloudwatch/)
+ [ Tutorial: Creating an Amazon EventBridge rule that sends notifications to Amazon Q Developer in chat applications ](https://docs.aws.amazon.com/chatbot/latest/adminguide/create-eventbridge-rule.html)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

# OPS08-BP05 Create dashboards
<a name="ops_workload_observability_create_dashboards"></a>

 Dashboards are the human-centric view into the telemetry data of your workloads. While they provide a vital visual interface, they should not replace alerting mechanisms, but complement them. When crafted with care, not only can they offer rapid insights into system health and performance, but they can also present stakeholders with real-time information on business outcomes and the impact of issues. 

 **Desired outcome:** Clear, actionable insights into system and business health using visual representations. 

 **Common anti-patterns:** 
+  Overcomplicating dashboards with too many metrics. 
+  Relying on dashboards without alerts for anomaly detection. 
+  Not updating dashboards as workloads evolve. 

 **Benefits of establishing this best practice:** 
+  Immediate visibility into critical system metrics and KPIs. 
+  Enhanced stakeholder communication and understanding. 
+  Rapid insight into the impact of operational issues. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 **Business-centric dashboards** 

 Dashboards tailored to business KPIs engage a wider array of stakeholders. While these individuals might not be interested in system metrics, they are keen on understanding the business implications of these numbers. A business-centric dashboard ensures that all technical and operational metrics being monitored and analyzed are in sync with overarching business goals. This alignment provides clarity, ensuring everyone is on the same page regarding what's essential and what's not. Additionally, dashboards that highlight business KPIs tend to be more actionable. Stakeholders can quickly understand the health of operations, areas that need attention, and the potential impact on business outcomes. 

 With this in mind, when creating your dashboards, ensure that there's a balance between technical metrics and business KPIs. Both are vital, but they cater to different audiences. Ideally, you should have dashboards that provide a holistic view of the system's health and performance while also emphasizing key business outcomes and their implications. 

 Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different AWS Regions and accounts. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Create a basic dashboard:** [Create a new dashboard in CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create_dashboard.html), giving it a descriptive name. 

1.  **Use Markdown widgets:** Before diving into metrics, use [Markdown widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_text_dashboard.html) to add textual context at the top of your dashboard. This should explain what the dashboard covers, the significance of the represented metrics, and can also contain links to other dashboards and troubleshooting tools. 

1.  **Create dashboard variables:** [Incorporate dashboard variables](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_dashboard_variables.html) where appropriate to allow for dynamic and flexible dashboard views. 

1.  **Create metrics widgets:** [Add metric widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-and-work-with-widgets.html) to visualize various metrics your application emits, tailoring these widgets to effectively represent system health and business outcomes. 

1.  **Log Insights queries:** Utilize [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_ExportQueryResults.html) to derive actionable metrics from your logs and display these insights on your dashboard. 

1.  **Set up alarms:** Integrate [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_remove_alarm_dashboard.html) into your dashboard for a quick view of any metrics breaching their thresholds. 

1.  **Use Contributor Insights:** Incorporate [CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights-ViewReports.html) to analyze high-cardinality fields and get a clearer understanding of your resource's top contributors. 

1.  **Design custom widgets:** For specific needs not met by standard widgets, consider creating [custom widgets](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/add_custom_widget_dashboard.html). These can pull from various data sources or represent data in unique ways. 

1.  **Iterate and refine:** As your application evolves, regularly revisit your dashboard to ensure its relevance. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS08-BP01 Analyze workload metrics](ops_workload_observability_analyze_workload_metrics.md) 
+  [OPS08-BP02 Analyze workload logs](ops_workload_observability_analyze_workload_logs.md) 
+  [OPS08-BP03 Analyze workload traces](ops_workload_observability_analyze_workload_traces.md) 
+  [OPS08-BP04 Create actionable alerts](ops_workload_observability_create_alerts.md) 

 **Related documents:** 
+ [ Building Dashboards for Operational Visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ Using Amazon CloudWatch Dashboards ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)

 **Related videos:** 
+ [ Create Cross Account & Cross Region CloudWatch Dashboards ](https://www.youtube.com/watch?v=eIUZdaqColg)
+ [AWS re:Invent 2021 - Gain enterprise visibility with AWS Cloud operation dashboards ](https://www.youtube.com/watch?v=NfMpYiGwPGo)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Application Monitoring with Amazon CloudWatch ](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch/)

# OPS 9. How do you understand the health of your operations?
<a name="ops-09"></a>

 Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action. 

**Topics**
+ [OPS09-BP01 Measure operations goals and KPIs with metrics](ops_operations_health_measure_ops_goals_kpis.md)
+ [OPS09-BP02 Communicate status and trends to ensure visibility into operation](ops_operations_health_communicate_status_trends.md)
+ [OPS09-BP03 Review operations metrics and prioritize improvement](ops_operations_health_review_ops_metrics_prioritize_improvement.md)

# OPS09-BP01 Measure operations goals and KPIs with metrics
<a name="ops_operations_health_measure_ops_goals_kpis"></a>

 Obtain goals and KPIs that define operations success from your organization and determine that metrics reflect these. Set baselines as a point of reference and reevaluate regularly. Develop mechanisms to collect these metrics from teams for evaluation. 

 **Desired outcome:** 
+  The goals and KPIs for the organization's operations teams have been published and shared. 
+  Metrics that reflect these KPIs are established. Examples may include: 
  +  Ticket queue depth or average age of ticket 
  +  Ticket count grouped by type of issue 
  +  Time spent working issues with or without a standardized operating procedure (SOP) 
  +  Amount of time spent recovering from a failed code push 
  +  Call volume 

 **Common anti-patterns:** 
+  Deployment deadlines are missed because developers are pulled away to perform troubleshooting tasks. Development teams argue for more personnel, but cannot quantify how many they need because the time taken away cannot be measured. 
+  A Tier 1 desk was set up to handle user calls. Over time, more workloads were added, but no headcount was allocated to the Tier 1 desk. Customer satisfaction suffers as call times increase and issues go longer without resolution, but management sees no indicators of such, preventing any action. 
+  A problematic workload has been handed off to a separate operations team for upkeep. Unlike other workloads, this new one was not supplied with proper documentation and runbooks. As such, teams spend longer troubleshooting and addressing failures. However, there are no metrics documenting this, which makes accountability difficult. 

 **Benefits of establishing this best practice:** Where workload monitoring shows the state of our applications and services, monitoring operations teams provide owners gain insight into changes among the consumers of those workloads, such as shifting business needs. Measure the effectiveness of these teams and evaluate them against business goals by creating metrics that can reflect the state of operations. Metrics can highlight support issues or identify when drifts occur away from a service level target. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Schedule time with business leaders and stakeholders to determine the overall goals of the service. Determine what the tasks of various operations teams should be and what challenges they could be approached with. Using these, brainstorm key performance indicators (KPIs) that might reflect these operations goals. These might be customer satisfaction, time from feature conception to deployment, average issue resolution time, and others. 

 Working from KPIs, identify the metrics and sources of data that might reflect these goals best. Customer satisfaction may be an combination of various metrics such as call wait or response times, satisfaction scores, and types of issues raised. Deployment times may be the sum of time needed for testing and deployment, plus any post-deployment fixes that needed to be added. Statistics showing the time spent on different types of issues (or the counts of those issues) can provide a window into where targeted effort is needed. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Quick - Using KPIs ](https://docs.aws.amazon.com/quicksight/latest/user/kpi.html)
+ [ Amazon CloudWatch - Using Metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)
+ [ Building Dashboards ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)
+ [ How to track your cost optimization KPIs with KPI Dashboard ](https://aws.amazon.com/blogs/aws-cloud-financial-management/how-to-track-your-cost-optimization-kpis-with-the-kpi-dashboard/)

# OPS09-BP02 Communicate status and trends to ensure visibility into operation
<a name="ops_operations_health_communicate_status_trends"></a>

 Knowing the state of your operations and its trending direction is necessary to identify when outcomes may be at risk, whether or not added work can be supported, or the effects that changes have had to your teams. During operations events, having status pages that users and operations teams can refer to for information can reduce pressure on communication channels and disseminate information proactively. 

 **Desired outcome:** 
+  Operations leaders have insight at a glance to see what sort of call volumes their teams are operating under and what efforts may be under way, such as deployments. 
+  Alerts are disseminated to stakeholders and user communities when impacts to normal operations occur. 
+  Organization leadership and stakeholders can check a status page in response to an alert or impact, and obtain information surrounding an operational event, such as points of contact, ticket information, and estimated recovery times. 
+  Reports are made available to leadership and other stakeholders to show operations statistics such as call volumes over a period of time, user satisfaction scores, numbers of outstanding tickets and their ages. 

 **Common anti-patterns:** 
+  A workload goes down, leaving a service unavailable. Call volumes spike as users request to know what's going on. Managers add to the volume requesting to know who's working an issue. Various operations teams duplicate efforts in trying to investigate. 
+  A desire for a new capability leads to several personnel being reassigned to an engineering effort. No backfill is provided, and issue resolution times spike. This information is not captured, and only after several weeks and dissatisfied user feedback does leadership become aware of the issue. 

 **Benefits of establishing this best practice:** During operational events where the business is impacted, much time and energy can be wasted querying information from various teams attempting to understand the situation. By establishing widely-disseminated status pages and dashboards, stakeholders can quickly obtain information such as whether or not an issue was detected, who has lead on the issue, or when a return to normal operations may be expected. This frees team members from spending too much time communicating status to others and more time addressing issues. 

 In addition, dashboards and reports can provide insights to decision-makers and stakeholders to see how operations teams are able to respond to business needs and how their resources are being allocated. This is crucial for determining if adequate resources are in place to support the business. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Build dashboards that show the current key metrics for your ops teams, and make them readily accessible both to operations leaders and management. 

 Build status pages that can be updated quickly to show when an incident or event is unfolding, who has ownership and who is coordinating the response. Share any steps or workarounds that users should consider on this page, and disseminate the location widely. Encourage users to check this location first when confronted with an unknown issue. 

 Collect and provide reports that show the health of operations over time, and distribute this to leaders and decision makers to illustrate the work of operations along with challenges and needs. 

 Share between teams these metrics and reports that best reflect goals and KPIs and where they have been influential in driving change. Dedicate time to these activities to elevate the importance of operations inside of and between teams. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Measure Progress ](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cloud-operating-model/measure-progress.html)
+ [ Building dashboards for operation visibility ](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/)

 **Related solutions:** 
+ [ Data Operations ](https://aws.amazon.com/solutions/app-development/data-operations)

# OPS09-BP03 Review operations metrics and prioritize improvement
<a name="ops_operations_health_review_ops_metrics_prioritize_improvement"></a>

 Setting aside dedicated time and resources for reviewing the state of operations ensures that serving the day-to-day line of business remains a priority. Pull together operations leaders and stakeholders to regularly review metrics, reaffirm or modify goals and objectives, and prioritize improvements. 

 **Desired outcome:** 
+  Operations leaders and staff regularly meet to review metrics over a given reporting period. Challenges are communicated, wins are celebrated, and lessons learned are shared. 
+  Stakeholders and business leaders are regularly briefed on the state of operations and solicited for input regarding goals, KPIs, and future initiatives. Tradeoffs between service delivery, operations, and maintenance are discussed and placed into context. 

 **Common anti-patterns:** 
+  A new product is launched, but the Tier 1 and Tier 2 operations teams are not adequately trained to support or given additional staff. Metrics that show the decrease in ticket resolution times and increase in incident volumes are not seen by leaders. Action is taken weeks later when subscription numbers start to fall as discontent users move off the platform. 
+  A manual process for performing maintenance on a workload has been in place for a long time. While a desire to automate has been present, this was a low priority given the low importance of the system. Over time however, the system has grown in importance and now these manual processes consume a majority of operations' time. No resources are scheduled for providing increased tooling to operations, leading to staff burnout as workloads increase. Leadership becomes aware once it's reported that staff are leaving for other competitors. 

 **Benefits of establishing this best practice:** In some organizations, it can become a challenge to allocate the same time and attention that is afforded to service delivery and new products or offerings. When this occurs, the line of business can suffer as the level of service expected slowly deteriorates. This is because operations does not change and evolve with the growing business, and can soon be left behind. Without regular review into the insights operations collects, the risk to the business may become visible only when it's too late. By allocating time to review metrics and procedures both among the operations staff and with leadership, the crucial role operations plays remains visible, and risks can be identified long before they reach critical levels. Operations teams get better insight into impending business changes and initiatives, allowing for proactive efforts to be undertaken. Leadership visibility into operations metrics showcases the role that these teams play in customer satisfaction, both internal and external, and let them better weigh choices for priorities, or ensure that operations has the time and resources to change and evolve with new business and workload initiatives. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Dedicate time to review operations metrics between stakeholders and operations teams and review report data. Place these reports in the contexts of the organizations goals and objectives to determine if they're being met. Identify sources of ambiguity where goals are not clear, or where there may be conflicts between what is asked for and what is given. 

 Identify where time, people, and tools can aid in operations outcomes. Determine which KPIs this would impact and what targets for success should be. Revisit regularly to ensure operations is resourced sufficiently to support the line of business. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Amazon Athena ](https://aws.amazon.com/athena/)
+ [ Amazon CloudWatch metrics and dimensions reference ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html)
+ [ Amazon Quick ](https://aws.amazon.com/quicksight/)
+ [AWS Glue](https://aws.amazon.com/glue/)
+ [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html)
+ [ Collect metrics and logs from Amazon EC2 instances and on-premises servers with the Amazon CloudWatch Agent ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html)
+ [ Using Amazon CloudWatch metrics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html)

# OPS 10. How do you manage workload and operations events?
<a name="ops-10"></a>

 Prepare and validate procedures for responding to events to minimize their disruption to your workload. 

**Topics**
+ [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md)
+ [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md)
+ [OPS10-BP03 Prioritize operational events based on business impact](ops_event_response_prioritize_events.md)
+ [OPS10-BP04 Define escalation paths](ops_event_response_define_escalation_paths.md)
+ [OPS10-BP05 Define a customer communication plan for outages](ops_event_response_push_notify.md)
+ [OPS10-BP06 Communicate status through dashboards](ops_event_response_dashboards.md)
+ [OPS10-BP07 Automate responses to events](ops_event_response_auto_event_response.md)

# OPS10-BP01 Use a process for event, incident, and problem management
<a name="ops_event_response_event_incident_problem_process"></a>

Your organization has processes to handle events, incidents, and problems. *Events* are things that occur in your workload but may not need intervention. *Incidents* are events that require intervention. *Problems* are recurring events that require intervention or cannot be resolved. You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately.

When incidents and problems happen to your workload, you need processes to handle them. How will you communicate the status of the event with stakeholders? Who oversees leading the response? What are the tools that you use to mitigate the event? These are examples of some of the questions you need answer to have a solid response process. 

Processes must be documented in a central location and available to anyone involved in your workload. If you don’t have a central wiki or document store, a version control repository can be used. You’ll keep these plans up to date as your processes evolve. 

Problems are candidates for automation. These events take time away from your ability to innovate. Start with building a repeatable process to mitigate the problem. Over time, focus on automating the mitigation or fixing the underlying issue. This frees up time to devote to making improvements in your workload. 

**Desired outcome:** Your organization has a process to handle events, incidents, and problems. These processes are documented and stored in a central location. They are updated as processes change. 

**Common anti-patterns:** 
+  An incident happens on the weekend and the on-call engineer doesn’t know what to do. 
+  A customer sends you an email that the application is down. You reboot the server to fix it. This happens frequently. 
+  There is an incident with multiple teams working independently to try to solve it. 
+  Deployments happen in your workload without being recorded. 

 **Benefits of establishing this best practice:** 
+  You have an audit trail of events in your workload. 
+  Your time to recover from an incident is decreased. 
+  Team members can resolve incidents and problems in a consistent manner. 
+  There is a more consolidated effort when investigating an incident. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed. 

 **Customer example** 

AnyCompany Retail has a portion of their internal wiki devoted to processes for event, incident, and problem management. All events are sent to [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html). Problems are identified as OpsItems in [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) and prioritized to fix, reducing undifferentiated labor. As processes change, they’re updated in their internal wiki. They use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to manage incidents and coordinate mitigation efforts. 

## Implementation steps
<a name="implementation-steps"></a>

1.  Events 
   +  Track events that happen in your workload, even if no human intervention is required. 
   +  Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching. 
   +  You can use services like [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) or [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) to generate custom events for tracking. 

1.  Incidents 
   +  Start by defining the communication plan for incidents. What stakeholders must be informed? How will you keep them in the loop? Who oversees coordinating efforts? We recommend standing up an internal chat channel for communication and coordination. 
   +  Define escalation paths for the teams that support your workload, especially if the team doesn’t have an on-call rotation. Based on your support level, you can also file a case with Support. 
   +  Create a playbook to investigate the incident. This should include the communication plan and detailed investigation steps. Include checking the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) in your investigation. 
   +  Document your incident response plan. Communicate the incident management plan so internal and external customers understand the rules of engagement and what is expected of them. Train your team members on how to use it. 
   +  Customers can use [Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to set up and manage their incident response plan. 
   +  Enterprise Support customers can request the [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement. 

1.  Problems 
   +  Problems must be identified and tracked in your ITSM system. 
   +  Identify all known problems and prioritize them by effort to fix and impact to workload.   
![\[Action priority matrix for prioritizing problems.\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/impact-effort-chart.png)
   +  Solve problems that are high impact and low effort first. Once those are solved, move on to problems to that fall into the low impact low effort quadrant. 
   +  You can use [Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) to identify these problems, attach runbooks to them, and track them. 

**Level of effort for the implementation plan:** Medium. You need both a process and tools to implement this best practice. Document your processes and make them available to anyone associated with the workload. Update them frequently. You have a process for managing problems and mitigating them or fixing them. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Known problems need an associated runbook so that mitigation efforts are consistent.
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Incidents must be investigated using playbooks. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Always conduct a postmortem after you recover from an incident. 

 **Related documents:** 
+  [Atlassian - Incident management in the age of DevOps](https://www.atlassian.com/incident-management/devops) 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+  [Incident Management in the Age of DevOps and SRE](https://www.infoq.com/presentations/incident-management-devops-sre/) 
+  [PagerDuty - What is Incident Management?](https://www.pagerduty.com/resources/learn/what-is-incident-management/) 

 **Related videos:** 
+  [AWS re:Invent 2020: Incident management in a distributed organization](https://www.youtube.com/watch?v=tyS1YDhMVos) 
+  [AWS re:Invent 2021 - Building next-gen applications with event-driven architectures](https://www.youtube.com/watch?v=U5GZNt0iMZY) 
+  [AWS Supports You \$1 Exploring the Incident Management Tabletop Exercise](https://www.youtube.com/watch?v=0m8sGDx-pRM) 
+  [AWS Systems Manager Incident Manager - AWS Virtual Workshops](https://www.youtube.com/watch?v=KNOc0DxuBSY) 
+  [AWS What's Next ft. Incident Manager \$1 AWS Events](https://www.youtube.com/watch?v=uZL-z7cII3k) 

 **Related examples:** 
+  [AWS Management and Governance Tools Workshop - OpsCenter](https://mng.workshop.aws/ssm/capability_hands-on_labs/opscenter.html) 
+  [AWS Proactive Services – Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [Building an event-driven application with Amazon EventBridge](https://aws.amazon.com/blogs/compute/building-an-event-driven-application-with-amazon-eventbridge/) 
+  [Building event-driven architectures on AWS](https://catalog.us-east-1.prod.workshops.aws/workshops/63320e83-6abc-493d-83d8-f822584fb3cb/en-US/) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 
+  [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 

# OPS10-BP02 Have a process per alert
<a name="ops_event_response_process_per_alert"></a>

 Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert. This ensures effective and prompt responses to operations events and prevents actionable events from being obscured by less valuable notifications. 

 **Common anti-patterns:** 
+  Your monitoring system presents you a stream of approved connections along with other messages. The volume of messages is so large that you miss periodic error messages that require your intervention. 
+  You receive an alert that the website is down. There is no defined process for when this happens. You are forced to take an ad hoc approach to diagnose and resolve the issue. Developing this process as you go extends the time to recovery. 

 **Benefits of establishing this best practice:** By alerting only when action is required, you prevent low value alerts from concealing high value alerts. By having a process for every actionable alert, you create a consistent and prompt response to events in your environment. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion. Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes. By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example, automatic scaling might be applied to scale a web front end, but the operations team might be accountable to ensure that the automatic scaling rules and limits are appropriate for workload needs. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

# OPS10-BP03 Prioritize operational events based on business impact
<a name="ops_event_response_prioritize_events"></a>

 Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, or damage to reputation or trust. 

 **Common anti-patterns:** 
+  You receive a support request to add a printer configuration for a user. While working on the issue, you receive a support request stating that your retail site is down. After completing the printer configuration for your user, you start work on the website issue. 
+  You get notified that both your retail website and your payroll system are down. You don't know which one should get priority. 

 **Benefits of establishing this best practice:** Prioritizing responses to the incidents with the greatest impact on the business notifies your management of that impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, regulatory violations, or damage to reputation or trust. 

# OPS10-BP04 Define escalation paths
<a name="ops_event_response_define_escalation_paths"></a>

 Define escalation paths in your runbooks and playbooks, including what initiates escalation, and procedures for escalation. Specifically identify owners for each action to ensure effective and prompt responses to operations events. 

 Identify when a human decision is required before an action is taken. Work with decision makers to have that decision made in advance, and the action preapproved, so that MTTR is not extended waiting for a response. 

 **Common anti-patterns:** 
+  Your retail site is down. You don't understand the runbook for recovering the site. You start calling colleagues hoping that someone will be able to help you. 
+  You receive a support case for an unreachable application. You don't have permissions to administer the system. You don't know who does. You attempt to contact the system owner that opened the case and there is no response. You have no contacts for the system and your colleagues are not familiar with it. 

 **Benefits of establishing this best practice:** By defining escalations, what initiates the escalation, and procedures for escalation you provide the systematic addition of resources to an incident at an appropriate rate for the impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define escalation paths: Define escalation paths in your runbooks and playbooks, including what starts escalation, and procedures for escalation. For example, escalation of an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed. Another example of an appropriate escalation path is from senior support engineers to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time has elapsed. Specifically identify owners for each action to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network connectivity provider or a software vendor. Escalations can include identified authorized decision makers for impacted systems. 

# OPS10-BP05 Define a customer communication plan for outages
<a name="ops_event_response_push_notify"></a>

 Define and test a communication plan for system outages that you can rely on to keep your customers and stakeholders informed during outages. Communicate directly with your users both when the services they use are impacted and when services return to normal. 

 **Desired outcome:** 
+  You have a communication plan for situations ranging from scheduled maintenance to large unexpected failures, including invocation of disaster recovery plans. 
+  In your communications, you provide clear and transparent information about systems issues to help customers avoid second guessing the performance of their systems. 
+  You use custom error messages and status pages to reduce the spike in help desk requests and keep users informed. 
+  The communication plan is regularly tested to verify that it will perform as intended when a real outage occurs. 

 **Common anti-patterns:** 
+ A workload outage occurs but you have no communication plan. Users overwhelm your trouble ticket system with requests because they have no information on the outage.
+ You send an email notification to your users during an outage. It doesn’t contain a timeline for restoration of service so users cannot plan around the outage.
+ There is a communication plan for outages but it has never been tested. An outage occurs and the communication plan fails because a critical step was missed that could have been caught in testing.
+  During an outage, you send a notification to users with too many technical details and information under your AWS NDA. 

 **Benefits of establishing this best practice:** 
+  Maintaining communication during outages ensures that customers are provided with visibility of progress on issues and estimated time to resolution. 
+  Developing a well-defined communications plan verifies that your customers and end users are well informed so they can take required additional steps to mitigate the impact of outages. 
+  With proper communications and increased awareness of planned and unplanned outages, you can improve customer satisfaction, limit unintended reactions, and drive customer retention. 
+  Timely and transparent system outage communication builds confidence and establishes trust needed to maintain relationships between you and your customers. 
+  A proven communication strategy during an outage or crisis reduces speculation and gossip that could hinder your ability to recover. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Communication plans that keep your customers informed during outages are holistic and cover multiple interfaces including customer facing error pages, custom API error messages, system status banners, and health status pages. If your system includes registered users, you can communicate over messaging channels such as email, SMS or push notifications to send personalized message content to your customers. 

 **Customer communication tools** 

 As a first line of defense, web and mobile applications should provide friendly and informative error messages during an outage as well as have the ability to redirect traffic to a status page. [Amazon CloudFront](https://aws.amazon.com/cloudfront/) is a fully managed content delivery network (CDN) that includes capabilities to define and serve custom error content. Custom error pages in CloudFront are a good first layer of customer messaging for component level outages. CloudFront can also simplify managing and activating a status page to intercept all requests during planned or unplanned outages. 

 Custom API error messages can help detect and reduce impact when outages are isolated to discrete services. [Amazon API Gateway](https://aws.amazon.com/api-gateway/) allows you to configure custom responses for your REST APIs. This allows you to provide clear and meaningful messaging to API consumers when API Gateway is not able to reach backend services. Custom messages can also be used to support outage banner content and notifications when a particular system feature is degraded due to service tier outages. 

 Direct messaging is the most personalized type of customer messaging. [Amazon Pinpoint](https://aws.amazon.com/pinpoint/) is a managed service for scalable multichannel communications. Amazon Pinpoint allows you to build campaigns that can broadcast messages widely across your impacted customer base over SMS, email, voice, push notifications, or custom channels you define. When you manage messaging with Amazon Pinpoint, message campaigns are well defined, testable, and can be intelligently applied to targeted customer segments. Once established, campaigns can be scheduled or started by events and they can easily be tested. 

 **Customer example** 

 When the workload is impaired, AnyCompany Retail sends out an email notification to their users. The email describes what business functionality is impaired and provides a realistic estimate of when service will be restored. In addition, they have a status page that shows real-time information about the health of their workload. The communication plan is tested in a development environment twice per year to validate that it is effective. 

 **Implementation steps** 

1.  Determine the communication channels for your messaging strategy. Consider the architectural aspects of your application and determine the best strategy for delivering feedback to your customers. This could include one or more of the guidance strategies outlined including error and status pages, custom API error responses, or direct messaging. 

1.  Design status pages for your application. If you’ve determined that status or custom error pages are suitable for your customers, you’ll need to design your content and messaging for those pages. Error pages explain to users why an application is not available, when it may become available again, and what they can do in the meantime. If your application uses Amazon CloudFront you can serve [custom error responses](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/GeneratingCustomErrorResponses.html) or use Lambda at Edge to [translate errors](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-examples.html#lambda-examples-update-error-status-examples) and rewrite page content. CloudFront also makes it possible to swap destinations from your application content to a static [Amazon S3](https://aws.amazon.com/s3/) content origin containing your maintenance or outage status page . 

1.  Design the correct set of API error statuses for your service. Error messages produced by API Gateway when it can’t reach backend services, as well as service tier exceptions, may not contain friendly messages suitable for display to end users. Without having to make code changes to your backend services, you can configure API Gateway [custom error responses](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html) to map HTTP response codes to curated API error messages. 

1.  Design messaging from a business perspective so that it is relevant to end users for your system and does not contain technical details. Consider your audience and align your messaging. For example, you may steer internal users towards a workaround or manual process that leverages alternate systems. External users may be asked to wait until the system is restored, or subscribe to updates to receive a notification once the system is restored. Define approved messaging for multiple scenarios including unexpected outages, planned maintenance, and partial system failures where a particular feature may be degraded or unavailable. 

1.  Templatize and automate your customer messaging. Once you have established your message content, you can use [Amazon Pinpoint](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html) or other tools to automate your messaging campaign. With Amazon Pinpoint you can create customer target segments for specific affected users and transform messages into templates. Review the [Amazon Pinpoint tutorial](https://docs.aws.amazon.com/pinpoint/latest/developerguide/tutorials.html) to get an understanding of how-to setup a messaging campaign. 

1.  Avoiding tightly coupling messaging capabilities to your customer facing system. Your messaging strategy should not have hard dependencies on system data stores or services to verify that you can successfully send messages when you experience outages. Consider building the ability to send messages from more than [one Availability Zone or Region](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_fault_isolation_multiaz_region_system.html) for messaging availability. If you are using AWS services to send messages, leverage data plane operations over [control plane operation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_avoid_control_plane.html) to invoke your messaging. 

 **Level of effort for the implementation plan:** High. Developing a communication plan, and the mechanisms to send it, can require a significant effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md) - Your communication plan should have a runbook associated with it so that your personnel know how to respond. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) - After an outage, conduct post-incident analysis to identify mechanisms to prevent another outage. 

 **Related documents:** 
+ [ Error Handling Patterns in Amazon API Gateway and AWS Lambda](https://aws.amazon.com/blogs/compute/error-handling-patterns-in-amazon-api-gateway-and-aws-lambda/)
+ [ Amazon API Gateway responses ](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html#supported-gateway-response-types)

 **Related examples:** 
+ [AWS Health Dashboard ](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/)
+ [ Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region ](https://aws.amazon.com/message/12721/)

 **Related services:** 
+ [AWS Support](https://aws.amazon.com/premiumsupport/)
+ [AWS Customer Agreement ](https://aws.amazon.com/agreement/)
+ [ Amazon CloudFront ](https://aws.amazon.com/cloudfront/)
+ [ Amazon API Gateway ](https://aws.amazon.com/api-gateway/)
+ [ Amazon Pinpoint ](https://aws.amazon.com/pinpoint/)
+ [ Amazon S3 ](https://aws.amazon.com/s3/)

# OPS10-BP06 Communicate status through dashboards
<a name="ops_event_response_dashboards"></a>

 Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. 

 You can create dashboards using [Amazon CloudWatch Dashboards](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) on customizable home pages in the CloudWatch console. Using business intelligence services such as [Quick](https://aws.amazon.com/quicksight/) you can create and publish interactive dashboards of your workload and operational health (for example, order rates, connected users, and transaction times). Create Dashboards that present system and business-level views of your metrics. 

 **Common anti-patterns:** 
+  Upon request, you run a report on the current utilization of your application for management. 
+  During an incident, you are contacted every twenty minutes by a concerned system owner wanting to know if it is fixed yet. 

 **Benefits of establishing this best practice:** By creating dashboards, you create self-service access to information helping your customers to informed themselves and determine if they need to take action. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Health Dashboard. 
  +  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

# OPS10-BP07 Automate responses to events
<a name="ops_event_response_auto_event_response"></a>

 Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 

 There are multiple ways to automate runbook and playbook actions on AWS. To respond to an event from a state change in your AWS resources, or from your own custom events, you should create [CloudWatch Events rules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) to initiate responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation). 

 To respond to a metric that crosses a threshold for a resource (for example, wait time), you should create [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) to perform one or more actions using Amazon EC2 actions, Auto Scaling actions, or to send a notification to an Amazon SNS topic. If you need to perform custom actions in response to an alarm, invoke Lambda through an Amazon SNS notification. Use Amazon SNS to publish event notifications and escalation messages to keep people informed. 

 AWS also supports third-party systems through the AWS service APIs and SDKs. There are a number of monitoring tools provided by AWS Partners and third parties that allow for monitoring, notifications, and responses. Some of these tools include New Relic, Splunk, Loggly, SumoLogic, and Datadog. 

 You should keep critical manual procedures available for use when automated procedures fail 

 **Common anti-patterns:** 
+  A developer checks in their code. This event could have been used to start a build and then perform testing but instead nothing happens. 
+  Your application logs a specific error before it stops working. The procedure to restart the application is well understood and could be scripted. You could use the log event to invoke a script and restart the application. Instead, when the error happens at 3am Sunday morning, you are woken up as the on-call resource responsible to fix the system. 

 **Benefits of establishing this best practice:** By using automated responses to events, you reduce the time to respond and limit the introduction of errors from manual activities. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
  +  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
  +  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 
+  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
+  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

 **Related examples:** 

# Evolve
<a name="a-evolve"></a>

**Topics**
+ [OPS 11. How do you evolve operations?](ops-11.md)

# OPS 11. How do you evolve operations?
<a name="ops-11"></a>

 Dedicate time and resources for nearly continuous incremental improvement to evolve the effectiveness and efficiency of your operations. 

**Topics**
+ [OPS11-BP01 Have a process for continuous improvement](ops_evolve_ops_process_cont_imp.md)
+ [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md)
+ [OPS11-BP03 Implement feedback loops](ops_evolve_ops_feedback_loops.md)
+ [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md)
+ [OPS11-BP05 Define drivers for improvement](ops_evolve_ops_drivers_for_imp.md)
+ [OPS11-BP06 Validate insights](ops_evolve_ops_validate_insights.md)
+ [OPS11-BP07 Perform operations metrics reviews](ops_evolve_ops_metrics_review.md)
+ [OPS11-BP08 Document and share lessons learned](ops_evolve_ops_share_lessons_learned.md)
+ [OPS11-BP09 Allocate time to make improvements](ops_evolve_ops_allocate_time_for_imp.md)

# OPS11-BP01 Have a process for continuous improvement
<a name="ops_evolve_ops_process_cont_imp"></a>

Evaluate your workload against internal and external architecture best practices. Conduct workload reviews at least once per year. Prioritize improvement opportunities into your software development cadence. 

 **Desired outcome:** 
+  You analyze your workload against architecture best practices at least yearly. 
+  Improvement opportunities are given equal priority in your software development process. 

 **Common anti-patterns:** 
+ You have not conducted an architecture review on your workload since it was deployed several years ago.
+ Improvement opportunities are given a lower priority and stay in the backlog.
+  There is no standard for implementing modifications to best practices for the organization. 

 **Benefits of establishing this best practice:** 
+  Your workload is kept up to date on architecture best practices. 
+  Evolving your workload is done in a deliberate manner. 
+  You can leverage organization best practices to improve all workloads. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 On at least a yearly basis, you conduct an architectural review of your workload. Using internal and external best practices, evaluate your workload and identify improvement opportunities. Prioritize improvement opportunities into your software development cadence. 

 **Customer example** 

 All workloads at AnyCompany Retail go through a yearly architecture review process. They developed their own checklist of best practices that apply to all workloads. Using the AWS Well-Architected Tool’s Custom Lens feature, they conduct reviews using the tool and their custom lens of best practices. Improvement opportunities generated from the reviews are given priority in their software sprints. 

 **Implementation steps** 

1.  Conduct periodic architecture reviews of your production workload at least yearly. Use a documented architectural standard that includes AWS-specific best practices. 

   1.  We recommend you use your own internally defined standards it for these reviews. If you do not have an internal standard, we recommend you use the AWS Well-Architected Framework. 

   1.  You can use the AWS Well-Architected Tool to create a Custom Lens of your internal best practices and conduct your architecture review. 

   1.  Customers can contact their AWS Solutions Architect to conduct a guided Well-Architected Framework Review of their workload. 

1.  Prioritize improvement opportunities identified during the review into your software development process. 

 **Level of effort for the implementation plan:** Low. You can use the AWS Well-Architected Framework to conduct your yearly architecture review. 

### Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) - Post-incident analysis is another generator for improvement items. Feed lessons learned into your internal list of architecture best practices. 
+  [OPS11-BP08 Document and share lessons learned](ops_evolve_ops_share_lessons_learned.md) - As you develop your own architecture best practices, share those across your organization. 

 **Related documents:** 
+ [AWS Well-Architected Tool - Custom lenses ](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html)
+ [AWS Well-Architected Whitepaper - The review process ](https://docs.aws.amazon.com/wellarchitected/latest/framework/the-review-process.html)
+ [ Customize Well-Architected Reviews using Custom Lenses and the AWS Well-Architected Tool](https://aws.amazon.com/blogs/mt/customize-well-architected-reviews-using-custom-lenses-and-the-aws-well-architected-tool/)
+ [ Implementing the AWS Well-Architected Custom Lens lifecycle in your organization ](https://aws.amazon.com/blogs/architecture/implementing-the-aws-well-architected-custom-lens-lifecycle-in-your-organization/)

 **Related videos:** 
+ [ Well-Architected Labs - Level 100: Custom Lenses on AWS Well-Architected Tool](https://www.wellarchitectedlabs.com/well-architectedtool/100_labs/100_custom_lenses_on_watool/)

 **Related examples:** 
+ [ The AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html)

# OPS11-BP02 Perform post-incident analysis
<a name="ops_evolve_ops_perform_rca_process"></a>

 Review customer-impacting events, and identify the contributing factors and preventative actions. Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses. Communicate contributing factors and corrective actions as appropriate, tailored to target audiences. 

 **Common anti-patterns:** 
+  You administer an application server. Approximately every 23 hours and 55 minutes all your active sessions are terminated. You have tried to identify what is going wrong on your application server. You suspect it could instead be a network issue but are unable to get cooperation from the network team as they are too busy to support you. You lack a predefined process to follow to get support and collect the information necessary to determine what is going on. 
+  You have had data loss within your workload. This is the first time it has happened and the cause is not obvious. You decide it is not important because you can recreate the data. Data loss starts occurring with greater frequency impacting your customers. This also places addition operational burden on you as you restore the missing data. 

 **Benefits of establishing this best practice:** Having a predefined processes to determine the components, conditions, actions, and events that contributed to an incident helps you to identify opportunities for improvement. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use a process to determine contributing factors: Review all customer impacting incidents. Have a process to identify and document the contributing factors of an incident so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses. Communicate root cause as appropriate, tailored to target audiences. 

# OPS11-BP03 Implement feedback loops
<a name="ops_evolve_ops_feedback_loops"></a>

Feedback loops provide actionable insights that drive decision making. Build feedback loops into your procedures and workloads. This helps you identify issues and areas that need improvement. They also validate investments made in improvements. These feedback loops are the foundation for continuously improving your workload.

 Feedback loops fall into two categories: *immediate feedback* and *retrospective analysis*. Immediate feedback is gathered through review of the performance and outcomes from operations activities. This feedback comes from team members, customers, or the automated output of the activity. Immediate feedback is received from things like A/B testing and shipping new features, and it is essential to failing fast. 

 Retrospective analysis is performed regularly to capture feedback from the review of operational outcomes and metrics over time. These retrospectives happen at the end of a sprint, on a cadence, or after major releases or events. This type of feedback loop validates investments in operations or your workload. It helps you measure success and validates your strategy. 

 **Desired outcome:** You use immediate feedback and retrospective analysis to drive improvements. There is a mechanism to capture user and team member feedback. Retrospective analysis is used to identify trends that drive improvements. 

 **Common anti-patterns:** 
+ You launch a new feature but have no way of receiving customer feedback on it.
+ After investing in operations improvements, you don’t conduct a retrospective to validate them.
+ You collect customer feedback but don’t regularly review it.
+ Feedback loops lead to proposed action items but they aren’t included in the software development process.
+  Customers don’t receive feedback on improvements they’ve proposed. 

 **Benefits of establishing this best practice:** 
+  You can work backwards from the customer to drive new features. 
+  Your organization culture can react to changes faster. 
+  Trends are used to identify improvement opportunities. 
+  Retrospectives validate investments made to your workload and operations. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing this best practice means that you use both immediate feedback and retrospective analysis. These feedback loops drive improvements. There are many mechanisms for immediate feedback, including surveys, customer polls, or feedback forms. Your organization also uses retrospectives to identify improvement opportunities and validate initiatives. 

 **Customer example** 

 AnyCompany Retail created a web form where customers can give feedback or report issues. During the weekly scrum, user feedback is evaluated by the software development team. Feedback is regularly used to steer the evolution of their platform. They conduct a retrospective at the end of each sprint to identify items they want to improve. 

## Implementation steps
<a name="implementation-steps"></a>

1. Immediate feedback
   +  You need a mechanism to receive feedback from customers and team members. Your operations activities can also be configured to deliver automated feedback. 
   +  Your organization needs a process to review this feedback, determine what to improve, and schedule the improvement. 
   +  Feedback must be added into your software development process. 
   +  As you make improvements, follow up with the feedback submitter. 
     +  You can use [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) to create and track these improvements as [OpsItems](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-working-with-OpsItems.html).

1.  Retrospective analysis 
   +  Conduct retrospectives at the end of a development cycle, on a set cadence, or after a major release. 
   +  Gather stakeholders involved in the workload for a retrospective meeting. 
   +  Create three columns on a whiteboard or spreadsheet: Stop, Start, and Keep. 
     +  *Stop* is for anything that you want your team to stop doing. 
     +  *Start* is for ideas that you want to start doing. 
     +  *Keep* is for items that you want to keep doing. 
   +  Go around the room and gather feedback from the stakeholders. 
   +  Prioritize the feedback. Assign actions and stakeholders to any Start or Keep items. 
   +  Add the actions to your software development process and communicate status updates to stakeholders as you make the improvements. 

 **Level of effort for the implementation plan:** Medium. To implement this best practice, you need a way to take in immediate feedback and analyze it. Also, you need to establish a retrospective analysis process. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP01 Evaluate external customer needs](ops_priorities_ext_cust_needs.md): Feedback loops are a mechanism to gather external customer needs. 
+  [OPS01-BP02 Evaluate internal customer needs](ops_priorities_int_cust_needs.md): Internal stakeholders can use feedback loops to communicate needs and requirements. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Post-incident analyses are an important form of retrospective analysis conducted after incidents. 
+  [OPS11-BP07 Perform operations metrics reviews](ops_evolve_ops_metrics_review.md): Operations metrics reviews identify trends and areas for improvement. 

 **Related documents:** 
+  [7 Pitfalls to Avoid When Building a CCOE](https://aws.amazon.com/blogs/enterprise-strategy/7-pitfalls-to-avoid-when-building-a-ccoe/) 
+  [Atlassian Team Playbook - Retrospectives](https://www.atlassian.com/team-playbook/plays/retrospective) 
+  [Email Definitions: Feedback Loops](https://aws.amazon.com/blogs/messaging-and-targeting/email-definitions-feedback-loops/) 
+  [Establishing Feedback Loops Based on the AWS Well-Architected Framework Review](https://aws.amazon.com/blogs/architecture/establishing-feedback-loops-based-on-the-aws-well-architected-framework-review/) 
+  [IBM Garage Methodology - Hold a retrospective](https://www.ibm.com/garage/method/practices/learn/practice_retrospective_analysis/) 
+  [Investopedia – The PDCS Cycle](https://www.investopedia.com/terms/p/pdca-cycle.asp) 
+  [Maximizing Developer Effectiveness by Tim Cochran](https://martinfowler.com/articles/developer-effectiveness.html) 
+  [Operations Readiness Reviews (ORR) Whitepaper - Iteration](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/iteration.html) 
+  [ITIL CSI - Continual Service Improvement](https://wiki.en.it-processmaps.com/index.php/ITIL_CSI_-_Continual_Service_Improvement)
+  [When Toyota met e-commerce: Lean at Amazon](https://www.mckinsey.com/capabilities/operations/our-insights/when-toyota-met-e-commerce-lean-at-amazon) 

 **Related videos:** 
+  [Building Effective Customer Feedback Loops](https://www.youtube.com/watch?v=zz_VImJRZ3U) 

 **Related examples: ** 
+  [Astuto - Open source customer feedback tool](https://github.com/riggraz/astuto) 
+  [AWS Solutions - QnABot on AWS](https://aws.amazon.com/solutions/implementations/qnabot-on-aws/) 
+  [Fider - A platform to organize customer feedback](https://github.com/getfider/fider) 

 **Related services:** 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 

# OPS11-BP04 Perform knowledge management
<a name="ops_evolve_ops_knowledge_management"></a>

Knowledge management helps team members find the information to perform their job. In learning organizations, information is freely shared which empowers individuals. The information can be discovered or searched. Information is accurate and up to date. Mechanisms exist to create new information, update existing information, and archive outdated information. The most common example of a knowledge management platform is a content management system like a wiki. 

 **Desired outcome:** 
+  Team members have access to timely, accurate information. 
+  Information is searchable. 
+  Mechanisms exist to add, update, and archive information. 

 **Common anti-patterns:** 
+ There is no centralized knowledge storage. Team members manage their own notes on their local machines.
+  You have a self-hosted wiki but no mechanisms to manage information, resulting in outdated information. 
+  Someone identifies missing information but there’s no process to request adding it the team wiki. They add it themselves but they miss a key step, leading to an outage. 

 **Benefits of establishing this best practice:** 
+  Team members are empowered because information is shared freely. 
+  New team members are onboarded faster because documentation is up to date and searchable. 
+  Information is timely, accurate, and actionable. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Knowledge management is an important facet of learning organizations. To begin, you need a central repository to store your knowledge (as a common example, a self-hosted wiki). You must develop processes for adding, updating, and archiving knowledge. Develop standards for what should be documented and let everyone contribute. 

 **Customer example** 

 AnyCompany Retail hosts an internal Wiki where all knowledge is stored. Team members are encouraged to add to the knowledge base as they go about their daily duties. On a quarterly basis, a cross-functional team evaluates which pages are least updated and determines if they should be archived or updated. 

 **Implementation steps** 

1.  Start with identifying the content management system where knowledge will be stored. Get agreement from stakeholders across your organization. 

   1.  If you don’t have an existing content management system, consider running a self-hosted wiki or using a version control repository as a starting point. 

1.  Develop runbooks for adding, updating, and archiving information. Educate your team on these processes. 

1.  Identify what knowledge should be stored in the content management system. Start with daily activities (runbooks and playbooks) that team members perform. Work with stakeholders to prioritize what knowledge is added. 

1.  On a periodic basis, work with stakeholders to identify out-of-date information and archive it or bring it up to date. 

 **Level of effort for the implementation plan:** Medium. If you don’t have an existing content management system, you can set up a self-hosted wiki or a version-controlled document repository. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP08 Document and share lessons learned](ops_evolve_ops_share_lessons_learned.md) - Knowledge management facilitates information sharing about lessons learned. 

 **Related documents:** 
+ [ Atlassian - Knowledge Management ](https://www.atlassian.com/itsm/knowledge-management)

 **Related examples:** 
+ [ DokuWiki ](https://www.dokuwiki.org/dokuwiki)
+ [ Gollum ](https://github.com/gollum/gollum)
+ [ MediaWiki ](https://www.mediawiki.org/wiki/MediaWiki)
+ [ Wiki.js ](https://github.com/Requarks/wiki)

# OPS11-BP05 Define drivers for improvement
<a name="ops_evolve_ops_drivers_for_imp"></a>

 Identify drivers for improvement to help you evaluate and prioritize opportunities. 

 On AWS, you can aggregate the logs of all your operations activities, workloads, and infrastructure to create a detailed activity history. You can then use AWS tools to analyze your operations and workload health over time (for example, identify trends, correlate events and activities to outcomes, and compare and contrast between environments and across systems) to reveal opportunities for improvement based on your drivers. 

 You should use CloudTrail to track API activity (through the AWS Management Console, CLI, SDKs, and APIs) to know what is happening across your accounts. Track your AWS developer Tools deployment activities with CloudTrail and CloudWatch. This will add a detailed activity history of your deployments and their outcomes to your CloudWatch Logs log data. 

 [Export your log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) for long-term storage. Using [AWS Glue](https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc), you discover and prepare your log data in Amazon S3 for analytics. Use [Amazon Athena](https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc), through its native integration with AWS Glue, to analyze your log data. Use a business intelligence tool like [Quick](https://aws.amazon.com/quicksight/) to visualize, explore, and analyze your data 

 **Common anti-patterns:** 
+  You have a script that works but is not elegant. You invest time in rewriting it. It is now a work of art. 
+  Your start-up is trying to get another set of funding from a venture capitalist. They want you to demonstrate compliance with PCI DSS. You want to make them happy so you document your compliance and miss a delivery date for a customer, losing that customer. It wasn't a wrong thing to do but now you wonder if it was the right thing to do. 

 **Benefits of establishing this best practice:** By determining the criteria you want to use for improvement, you can minimize the impact of event based motivations or emotional investment. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Understand drivers for improvement: You should only make changes to a system when a desired outcome is supported. 
  +  Desired capabilities: Evaluate desired features and capabilities when evaluating opportunities for improvement. 
    +  [What's New with AWS](https://aws.amazon.com/new/) 
  +  Unacceptable issues: Evaluate unacceptable issues, bugs, and vulnerabilities when evaluating opportunities for improvement. 
    +  [AWS Latest Security Bulletins](https://aws.amazon.com/security/security-bulletins/) 
    +  [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/trustedadvisor/) 
  +  Compliance requirements: Evaluate updates and changes required to maintain compliance with regulation, policy, or to remain under support from a third party, when reviewing opportunities for improvement. 
    +  [AWS Compliance](https://aws.amazon.com/compliance/) 
    +  [AWS Compliance Programs](https://aws.amazon.com/compliance/programs/) 
    +  [AWS Compliance Latest News](https://aws.amazon.com/compliance/compliance-latest-news/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon Athena](https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [AWS Compliance](https://aws.amazon.com/compliance/) 
+  [AWS Compliance Latest News](https://aws.amazon.com/compliance/compliance-latest-news/) 
+  [AWS Compliance Programs](https://aws.amazon.com/compliance/programs/) 
+  [AWS Glue](https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) 
+  [AWS Latest Security Bulletins](https://aws.amazon.com/security/security-bulletins/) 
+  [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/trustedadvisor/) 
+  [Export your log data to Amazon S3](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html) 
+  [What's New with AWS](https://aws.amazon.com/new/) 

# OPS11-BP06 Validate insights
<a name="ops_evolve_ops_validate_insights"></a>

 Review your analysis results and responses with cross-functional teams and business owners. Use these reviews to establish common understanding, identify additional impacts, and determine courses of action. Adjust responses as appropriate. 

 **Common anti-patterns:** 
+  You see that CPU utilization is at 95% on a system and make it a priority to find a way to reduce load on the system. You determine the best course of action is to scale up. The system is a transcoder and the system is scaled to run at 95% CPU utilization all the time. The system owner could have explained the situation to you had you contacted them. Your time has been wasted. 
+  A system owner maintains that their system is mission critical. The system was not placed in a high security environment. To improve security, you implement the additional detective and preventative controls that are required for mission critical systems. You notify the system owner that the work is complete and that he will be charged for the additional resources. In the discussion following this notification, the system owner learns there is a formal definition for mission critical systems that this system does not meet. 

 **Benefits of establishing this best practice:** By validating insights with business owners and subject matter experts, you can establish common understanding and more effectively guide improvement. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Validate insights: Engage with business owners and subject matter experts to ensure there is common understanding and agreement of the meaning of the data you have collected. Identify additional concerns, potential impacts, and determine a courses of action. 

# OPS11-BP07 Perform operations metrics reviews
<a name="ops_evolve_ops_metrics_review"></a>

 Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business. Use these reviews to identify opportunities for improvement, potential courses of action, and to share lessons learned. 

 Look for opportunities to improve in all of your environments (for example, development, test, and production). 

 **Common anti-patterns:** 
+  There was a significant retail promotion that was interrupted by your maintenance window. The business remains unaware that there is a standard maintenance window that could be delayed if there are other business impacting events. 
+  You suffered an extended outage because of your use of a buggy library commonly used in your organization. You have since migrated to a reliable library. The other teams in your organization do not know that they are at risk. If you met regularly and reviewed this incident, they would be aware of the risk. 
+  Performance of your transcoder has been falling off steadily and impacting the media team. It isn't terrible yet. You will not have an opportunity to find out until it is bad enough to cause an incident. Were you to review your operations metrics with the media team, there would be an opportunity for the change in metrics and their experience to be recognized and the issue addressed. 
+  You are not reviewing your satisfaction of customer SLAs. You are trending to not meet your customer SLAs. There are financial penalties related to not meeting your customer SLAs. If you meet regularly to review the metrics for these SLAs, you would have the opportunity to recognize and address the issue. 

 **Benefits of establishing this best practice:** By meeting regularly to review operations metrics, events, and incidents, you maintain common understanding across teams, share lessons learned, and can prioritize and target improvements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Operations metrics reviews: Regularly perform retrospective analysis of operations metrics with cross-team participants from different areas of the business. Engage stakeholders, including the business, development, and operations teams, to validate your findings from immediate feedback and retrospective analysis, and to share lessons learned. Use their insights to identify opportunities for improvement and potential courses of action. 
  +  [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
  +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [Publish custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 

# OPS11-BP08 Document and share lessons learned
<a name="ops_evolve_ops_share_lessons_learned"></a>

 Document and share lessons learned from the operations activities so that you can use them internally and across teams. 

 You should share what your teams learn to increase the benefit across your organization. You will want to share information and resources to prevent avoidable errors and ease development efforts. This will allow you to focus on delivering desired features. 

 Use AWS Identity and Access Management (IAM) to define permissions permitting controlled access to the resources you wish to share within and across accounts. You should then use version-controlled AWS CodeCommit repositories to share application libraries, scripted procedures, procedure documentation, and other system documentation. Share your compute standards by sharing access to your AMIs and by authorizing the use of your Lambda functions across accounts. You should also share your infrastructure standards as AWS CloudFormation templates. 

 Through the AWS APIs and SDKs, you can integrate external and third-party tools and repositories (for example, GitHub, BitBucket, and SourceForge). When sharing what you have learned and developed, be careful to structure permissions to ensure the integrity of shared repositories. 

 **Common anti-patterns:** 
+  You suffered an extended outage because of your use of a buggy library commonly used in your organization. You have since migrated to a reliable library. The other teams in your organization do not know they are at risk. Were you to document and share your experience with this library, they would be aware of the risk. 
+  You have identified an edge case in an internally shared microservice that causes sessions to drop. You have updated your calls to the service to avoid this edge case. The other teams in your organization do not know that they are at risk. Were you to document and share your experience with this library, they would be aware of the risk. 
+  You have found a way to significantly reduce the CPU utilization requirements for one of your microservices. You do not know if any other teams could take advantage of this technique. Were you to document and share your experience with this library, they would have the opportunity to do so. 

 **Benefits of establishing this best practice:** Share lessons learned to support improvement and to maximize the benefits of experience. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Document and share lessons learned: Have procedures to document the lessons learned from the running of operations activities and retrospective analysis so that they can be used by other teams. 
  +  Share learnings: Have procedures to share lessons learned and associated artifacts across teams. For example, share updated procedures, guidance, governance, and best practices through an accessible wiki. Share scripts, code, and libraries through a common repository. 
    +  [Delegating access to your AWS environment](https://www.youtube.com/watch?v=0zJuULHFS6A&t=849s) 
    +  [Share an AWS CodeCommit repository](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-share-repository.html) 
    +  [Easy authorization of AWS Lambda functions](https://aws.amazon.com/blogs/compute/easy-authorization-of-aws-lambda-functions/) 
    +  [Sharing an AMI with specific AWS Accounts](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sharingamis-explicit.html) 
    +  [Speed template sharing with an AWS CloudFormation designer URL](https://aws.amazon.com/blogs/devops/speed-template-sharing-with-an-aws-cloudformation-designer-url/) 
    +  [Using AWS Lambda with Amazon SNS](https://docs.aws.amazon.com/lambda/latest/dg/with-sns-example.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Easy authorization of AWS Lambda functions](https://aws.amazon.com/blogs/compute/easy-authorization-of-aws-lambda-functions/) 
+  [Share an AWS CodeCommit repository](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-share-repository.html) 
+  [Sharing an AMI with specific AWS Accounts](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sharingamis-explicit.html) 
+  [Speed template sharing with an AWS CloudFormation designer URL](https://aws.amazon.com/blogs/devops/speed-template-sharing-with-an-aws-cloudformation-designer-url/) 
+  [Using AWS Lambda with Amazon SNS](https://docs.aws.amazon.com/lambda/latest/dg/with-sns-example.html) 

 **Related videos:** 
+  [Delegating access to your AWS environment](https://www.youtube.com/watch?v=0zJuULHFS6A&t=849s) 

# OPS11-BP09 Allocate time to make improvements
<a name="ops_evolve_ops_allocate_time_for_imp"></a>

 Dedicate time and resources within your processes to make continuous incremental improvements possible. 

 On AWS, you can create temporary duplicates of environments, lowering the risk, effort, and cost of experimentation and testing. These duplicated environments can be used to test the conclusions from your analysis, experiment, and develop and test planned improvements. 

 **Common anti-patterns:** 
+  There is a known performance issue in your application server. It is added to the backlog behind every planned feature implementation. If the rate of planned features being added remains constant, the performance issue will never be addressed. 
+  To support continual improvement you approve administrators and developers using all their extra time to select and implement improvements. No improvements are ever completed. 

 **Benefits of establishing this best practice:** By dedicating time and resources within your processes you make continuous incremental improvements possible. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Allocate time to make improvements: Dedicate time and resources within your processes to make continuous incremental improvements possible. Implement changes to improve and evaluate the results to determine success. If the results do not satisfy the goals, and the improvement is still a priority, pursue alternative courses of action.