

# Prepare
<a name="a-prepare"></a>

**Topics**
+ [OPS 4. How do you design your workload so that you can understand its state?](ops-04.md)
+ [OPS 5. How do you reduce defects, ease remediation, and improve flow into production?](ops-05.md)
+ [OPS 6. How do you mitigate deployment risks?](ops-06.md)
+ [OPS 7. How do you know that you are ready to support a workload?](ops-07.md)

# OPS 4. How do you design your workload so that you can understand its state?
<a name="ops-04"></a>

 Design your workload so that it provides the information necessary across all components (for example, metrics, logs, and traces) for you to understand its internal state. This allows you to provide effective responses when appropriate. 

**Topics**
+ [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md)
+ [OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md)
+ [OPS04-BP03 Implement user activity telemetry](ops_telemetry_customer_telemetry.md)
+ [OPS04-BP04 Implement dependency telemetry](ops_telemetry_dependency_telemetry.md)
+ [OPS04-BP05 Implement transaction traceability](ops_telemetry_dist_trace.md)

# OPS04-BP01 Implement application telemetry
<a name="ops_telemetry_application_telemetry"></a>

 Application telemetry is the foundation for observability of your workload. Your application should emit telemetry that provides insight into the state of the application and the achievement of business outcomes. From troubleshooting to measuring the impact of a new feature, application telemetry informs the way you build, operate, and evolve your workload. 

 Application telemetry consists of metrics and logs. Metrics are diagnostic information, such as your pulse or temperature. Metrics are used collectively to describe the state of your application. Collecting metrics over time can be used to develop baselines and detect anomalies. Logs are messages that the application sends about its internal state or events that occur. Error codes, transaction identifiers, and user actions are examples of events that are logged. 

 **Desired Outcome:** 
+  Your application emits metrics and logs that provide insight into its health and the achievement of business outcomes. 
+  Metrics and logs are stored centrally for all applications in the workload. 

 **Common anti-patterns:** 
+  Your application doesn't emit telemetry. You are forced to rely upon your customers to tell you when something is wrong. 
+  A customer has reported that your application is unresponsive. You have no telemetry and are unable to confirm that the issue exists or characterize the issue without using the application yourself to understand the current user experience. 

 **Benefits of establishing this best practice:** 
+  You can understand the health of your application, the user experience, and the achievement of business outcomes. 
+  You can react quickly to changes in your application health. 
+  You can develop application health trends. 
+  You can make informed decisions about improving your application. 
+  You can detect and resolve application issues faster. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing application telemetry consists of three steps: identifying a location to store telemetry, identifying telemetry that describes the state of the application, and instrumenting the application to emit telemetry. 

 **Customer example** 

AnyCompany Retail has a microservices based architecture. As part of their architectural design process, they identified application telemetry that would help them understand the state of each microservice. For example, the user cart service emits telemetry about events like add to cart, abandon cart, and length of time it took to add an item to the cart. All microservices log errors, warnings, and transaction information. Telemetry is sent to Amazon CloudWatch for storage and analysis. 

 **Implementation steps** 

1.  Identify a central location for telemetry storage for the applications in your workload. The location should support both collection of telemetry and analysis capabilities. Anomaly detection and automated insights are recommended features. 

   1.  [Amazon CloudWatch](https://aws.amazon.com/cloudwatch) provides telemetry collection, dashboards, analysis, and event generation capabilities. 

1.  To identify what telemetry you need, start by answering this question: what is the state of my application? Your application should emit logs and metrics that collectively answer this question. If you can’t answer the questions with the existing application telemetry, work with business and engineering stakeholders to create a list of telemetry requirements. 

   1.  You can request expert technical advice from your AWS account team as you identify and develop new application telemetry. 

1.  Once the additional application telemetry has been identified, work with your engineering stakeholders to instrument your application. 

   1.  The [AWS Distro for Open Telemetry](https://aws-otel.github.io/) provides APIs, libraries, and agents that collect application telemetry. [This example demonstrates how to instrument a JavaScript application with custom metrics](https://aws-otel.github.io/docs/getting-started/js-sdk/metric-manual-instr). 

   1.  If you want to understand the observability services that AWS offers, work through the [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) or request support from your AWS account team. 

   1.  For a deeper dive into application telemetry, read the [Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) article in the Amazon Builder’s Library, which explains how Amazon instruments applications and can serve as a guide for developing your own instrumentation guidelines. 

 **Level of effort for the implementation plan:** High. Instrumenting your application and centralizing telemetry storage can take significant investment. 

## Resources
<a name="resources"></a>

 **Related best practices:** 

[OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md) – Application telemetry is a component of workload telemetry. In order to understand the health of the overall workload you need to understand the health of individual applications that make up the workload. 

[OPS04-BP03 Implement user activity telemetry](ops_telemetry_customer_telemetry.md) – User activity telemetry is often a subset of application telemetry. User activity like add to cart events, click streams, or completed transactions provide insight into the user experience. 

[OPS04-BP04 Implement dependency telemetry](ops_telemetry_dependency_telemetry.md) – Dependency checks are related to application telemetry and may be instrumented into your application. If your application relies on external dependencies like DNS or a database your application can emit metrics and logs on reachability, timeouts, and other events. 

[OPS04-BP05 Implement transaction traceability](ops_telemetry_dist_trace.md) – Tracing transactions across a workload requires each application to emit information about how they process shared events. The way individual applications handle these events is emitted through their application telemetry. 

[OPS08-BP02 Define workload metrics](ops_workload_health_design_workload_metrics.md) – Workload metrics are the key health indicators for your workload. Key application metrics are a part of workload metrics. 

 **Related documents:** 
+  [AWS Builders Library – Instrumenting Distributed Systems for Operational Visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/) 
+  [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) 
+  [AWS Well-Architected Operational Excellence Whitepaper – Design Telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/design-telemetry.html) 
+  [Creating metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) 
+  [Implementing Logging and Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/welcome.html) 
+  [Monitoring application health and performance with AWS Distro for OpenTelemetry](https://aws.amazon.com/blogs/opensource/monitoring-application-health-and-performance-with-aws-distro-for-opentelemetry/) 
+  [New – How to better monitor your custom application metrics using Amazon CloudWatch Agent](https://aws.amazon.com/blogs/devops/new-how-to-better-monitor-your-custom-application-metrics-using-amazon-cloudwatch-agent/) 
+  [Observability at AWS](https://aws.amazon.com/products/management-and-governance/use-cases/monitoring-and-observability/) 
+  [Scenario – Publish metrics to CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/PublishMetrics.html) 
+  [Start Building – How to Monitor your Applications Effectively](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/) 
+  [Using CloudWatch with an AWS SDK](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/sdk-general-information-section.html) 

 **Related videos:** 
+  [AWS re:Invent 2021 - Observability the open-source way](https://www.youtube.com/watch?v=vAnIhIwE5hY) 
+  [Collect Metrics and Logs from Amazon EC2 instances with the CloudWatch Agent](https://www.youtube.com/watch?v=vAnIhIwE5hY) 
+  [How to Easily Setup Application Monitoring for Your AWS Workloads - AWS Online Tech Talks](https://www.youtube.com/watch?v=LKCth30RqnA) 
+  [Mastering Observability of Your Serverless Applications - AWS Online Tech Talks](https://www.youtube.com/watch?v=CtsiXhiAUq8) 
+  [Open Source Observability with AWS - AWS Virtual Workshop](https://www.youtube.com/watch?v=vAnIhIwE5hY) 

 **Related examples:** 
+  [AWS Logging & Monitoring Example Resources](https://github.com/aws-samples/logging-monitoring-apg-guide-examples) 
+  [AWS Solution: Amazon CloudWatch Monitoring Framework](https://aws.amazon.com/solutions/implementations/amazon-cloudwatch-monitoring-framework/?did=sl_card&trk=sl_card) 
+  [AWS Solution: Centralized Logging](https://aws.amazon.com/solutions/implementations/centralized-logging/) 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) 

 **Related services:** 
+ [ Amazon CloudWatch ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)

# OPS04-BP02 Implement and configure workload telemetry
<a name="ops_telemetry_workload_telemetry"></a>

 Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events. Use this information to help determine when a response is required. 

 Use a service such as [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to aggregate logs and metrics from workload components (for example, API logs from [AWS CloudTrail](https://aws.amazon.com/cloudtrail/), [AWS Lambda metrics](https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html), [Amazon VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html), and [other services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/aws-services-sending-logs.html)). 

 **Common anti-patterns:** 
+  Your customers are complaining about poor performance. There are no recent changes to your application and so you suspect an issue with a workload component. You have no telemetry to analyze to determine what component or components are contributing to the poor performance. 
+  Your application is unreachable. You lack the telemetry to determine if it's a networking issue. 

 **Benefits of establishing this best practice:** Understanding what is going on inside your workload helps you to respond if necessary. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Implement log and metric telemetry: Instrument your workload to emit information about its internal state, status, and the achievement of business outcomes. Use this information to determine when a response is required. 
  +  [Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks](https://youtu.be/1Ck_me4azMw) 
  +  [How Amazon CloudWatch works](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html) 
  +  [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 
  +  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
  +  [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 
    +  Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status (for example, API call volume, HTTP status codes, and scaling events). 
      +  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
      +  [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) 
      +  [What Is AWS CloudTrail?](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) 
      +  [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) 
+  [Amazon CloudWatch Documentation](https://docs.aws.amazon.com/cloudwatch/index.html) 
+  [Amazon CloudWatch metrics and dimensions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) 
+  [How Amazon CloudWatch works](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_architecture.html) 
+  [Using Amazon CloudWatch metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) 
+  [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) 
+  [What Is AWS CloudTrail?](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) 
+  [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 
+  [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 

 **Related videos:** 
+  [Application Performance Management on AWS](https://www.youtube.com/watch?v=5T4stR-HFas) 
+  [Gaining Better Observability of Your VMs with Amazon CloudWatch](https://youtu.be/1Ck_me4azMw) 
+  [Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks](https://youtu.be/1Ck_me4azMw) 

# OPS04-BP03 Implement user activity telemetry
<a name="ops_telemetry_customer_telemetry"></a>

Instrument your application code to emit information about user activity. Examples of user activity include click streams or started, abandoned, and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required. Capturing real user activity allows you to build synthetic activity that can be used to monitor and test your workload in production.

 **Desired outcome:** 
+  Your workload emits telemetry about user activity across all applications. 
+  You leverage synthetic user activity to monitor your application during off-peak hours. 

 **Common anti-patterns:** 
+ Your developers have deployed a new feature without user telemetry. You cannot tell if your customers are using the feature without asking them. 
+ After a deployment to your front-end application, you see increased utilization. Because you lack user activity telemetry, it is difficult to identify the exact issue.
+  An issue occurs in your application during off-peak hours. You do not notice the issue until the morning when your users come online because you have not configured synthetic user activity. 

 **Benefits of establishing this best practice:** 
+  Understand common user patterns or unexpected behaviors to optimize functionality of the application to fit your business goals. 
+  Monitor the application from the perspective of your users to detect problems with user experience, such as broken links or slow click responses 
+  Identify the root cause of issues by tracing the steps your impacted user has taken. 
+  Synthetic user activity can provide early warning signs of performance degradation during off-peak hours, allowing you to take corrective action before actual users are affected. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Design your application code to emit information about user activity. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required. Utilize synthetic user activity to provide insight into application performance during off-peak hours. 

 **Customer example** 

 AnyCompany Retail implements user activity telemetry at several layers in their application. The front-end telemetry tracks pointer and movement events while the backend microservices emit telemetry tracking events like adding an item to the user's cart and checking out. Together they provide observability into the user experience. AnyCompany Retail also uses synthetic user telemetry to catch problems when there are fewer users on the workload. 

 **Implementation steps** 

1.  Instrument your application to emit telemetry (metrics, events, logs, and traces) about user activity. Once instrumented, front-end components emit telemetry automatically as the user interacts with the user interface. Backend applications emit telemetry on user events and transactions. 

   1.  [Amazon CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) can provide insight into end user experience for front-end applications. 

   1.  You can use the [AWS Distro for Open Telemetry](https://aws-otel.github.io/) to instrument and capture telemetry from your applications. 

   1.  [Amazon Pinpoint](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html) can analyze user behavior through campaigns, providing insight on user engagement. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop helps you build an observability strategy for your workload. 

1.  Establish synthetic user activity to monitor your application. Synthetic user activity simulates user actions to validate that your application is working properly. 

   1.  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) can simulate user activity using canaries. 

 **Level of effort for the implementation plan:** High. It may take significant development effort to fully instrument your application to collect user activity telemetry. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - Application telemetry is required in order to build in user activity telemetry. 
+  [OPS04-BP02 Implement and configure workload telemetry](ops_telemetry_workload_telemetry.md) - Some user activity telemetry may also be considered workload telemetry. 

 **Related documents:** 
+ [ How to Monitor your Applications Effectively ](https://aws.amazon.com/startups/start-building/how-to-monitor-applications/)

 **Related videos:** 
+ [AWS re:Invent 2020: Monitoring production services at Amazon ](https://www.youtube.com/watch?v=hnPcf_Czbvw)
+ [AWS re:Invent 2021 - Optimize applications through end user insights with Amazon CloudWatch RUM ](https://www.youtube.com/watch?v=NMaeujY9A9Y)
+ [ Testing and Monitoring APIs on AWS - AWS Online Tech Talks ](https://www.youtube.com/watch?v=VQM38CZyjFY)

 **Related examples:** 
+ [ Amazon CloudWatch RUM Web Client ](https://github.com/aws-observability/aws-rum-web)
+ [AWS Distro for Open Telemetry ](https://aws-otel.github.io/)
+ [ Implementing Real User Monitoring of Amplify Application using Amazon CloudWatch RUM ](https://aws.amazon.com/blogs/mobile/implementing-real-user-monitoring-of-amplify-application-using-amazon-cloudwatch-rum/)
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)

 **Related services:** 
+ [ Amazon CloudWatch RUM ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html)
+ [ Amazon CloudWatch Synthetics ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)
+ [ Amazon Pinpoint ](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html)

# OPS04-BP04 Implement dependency telemetry
<a name="ops_telemetry_dependency_telemetry"></a>

Design and configure your workload to emit information about the status of resources it depends on. These are resources that are external to your workload. Examples of external dependencies can include external databases, DNS, and network connectivity. Use this information to determine when a response is required and provide additional context on workload state.

 **Desired outcome:** 
+  Your workload emits telemetry about the status of external dependencies. 
+  You are notified when dependencies are unhealthy. 

 **Common anti-patterns:** 
+ Your users cannot reach your site. You are unable to determine if the reason is a DNS issue without manually performing a check to see if your DNS provider is working. 
+ Your shopping cart application is unable to complete transactions. You are unable to determine if it's a problem with your credit card processing provider without contacting them to verify. 

 **Benefits of establishing this best practice:** 
+  Monitoring external dependencies provides advance notice of issues. 
+  Awareness of the health of your dependencies assists in troubleshooting. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Work with stakeholders to identify external dependencies that your workload depends on. External dependencies can include external databases, APIs, or network connectivity between your workload and resources in other environments. Develop a monitoring strategy to provide awareness of the health of dependencies and proactively alarm if the status changes. 

 **Customer example** 

 AnyCompany Retail’s ecommerce workload relies on a database located in another environment. Every night, data is populated in the database for use in the ecommerce platform. The network connectivity and database support are owned by other teams. The ecommerce team configured several canary alarms to alert them when the network connectivity drops, the database is unreachable, and when the job fails to complete. 

 **Implementation steps** 

1.  Identify external dependencies that your workload relies on. Implement telemetry to track the health or reachability of dependencies. 

   1.  AWS customers can use the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) to monitor the health of AWS services and receive notifications of health events. 

   1.  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) can be used to monitor APIs, URLs, and website contents. 

1.  Set up alerts to notify your organization when a dependency is unhealthy or unreachable. 

   1.  Customers with Enterprise Support can request the [Building a Monitoring Strategy Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. This workshop will help you build an observability strategy for your workload. 

1.  Identify contacts for dependencies in cases where the dependency is unhealthy. Document how to contact the dependency owner, service agreements, and escalation process. 

 **Level of effort for the implementation plan:** Medium. Implementing dependency telemetry may require building custom monitoring solutions. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - You may build dependency monitoring into your application telemetry. 

 **Related documents:** 
+ [ Monitor your private internal endpoints 24x7 using CloudWatch Synthetics ](https://aws.amazon.com/blogs/mt/monitor-your-private-endpoints-using-cloudwatch-synthetics/)

 **Related videos:** 
+ [AWS re:Invent 2018: Monitor All Your Things: Amazon CloudWatch in Action with BBC ](https://www.youtube.com/watch?v=uuBuc6OAcVY)
+ [AWS re:Invent 2022 - Developing an observability strategy ](https://www.youtube.com/watch?v=Ub3ATriFapQ)
+ [AWS re:Invent 2022 - Observability best practices at Amazon ](https://www.youtube.com/watch?v=zZPzXEBW4P8)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Well-Architected Labs - Dependency Monitoring ](https://www.wellarchitectedlabs.com/operational-excellence/100_labs/100_dependency_monitoring/)

 **Related services:** 
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+ [AWS Health](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html)

# OPS04-BP05 Implement transaction traceability
<a name="ops_telemetry_dist_trace"></a>

Implement your application code and configure your workload components to emit events, which are started as a result of single logical operations and consolidated across various boundaries of your workload. Generate maps to see how traces flow across your workload and services. Gain insight into the relationships between components, and identify and analyze issues. Use the collected information to determine when a response is required and to assist you in identifying the factors contributing to an issue. 

 **Desired outcome:** 
+  Collect transaction traces across your workload to gain insight into the relationship between components. 
+  Generate maps to gain a better understanding of how transactions and events flow across your workload. 

 **Common anti-patterns:** 
+  You have implemented a serverless microservices architecture spanning multiple accounts. Your customers are experiencing intermittent performance issues. You are unable to discover which function or component is responsible because you lack transaction traceability. 
+ There is a performance bottleneck in your workload. Because you lack transaction traceability, you are unable to see the relationship between your application components and identify the bottleneck.
+  The identifier used for traces is not globally unique, resulting in a tracing collision when analyzing workload behavior. 

 **Benefits of establishing this best practice:** 
+  Understanding the flow of transactions across your workload provides insight into the expected behavior of your workload transactions. 
+  You can see variations from expected behavior across your workload and you can respond if necessary. 
+  You can pinpoint transactions by their unique generated identifier independent from where they were generated. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Design your application and workload to emit information about the flow of transactions across system components. Data to include in transactions are a globally unique transaction identifier, transaction stage, active component, and time to complete activity. Use this information to determine what is in progress, what is complete, and what the results of completed activities are. 

 **Customer example** 

 At AnyCompany Retail, all transactions have a globally unique UUID generated. This UUID is passed between microservices during transactions. The UUID is used to create transaction traces as users interact with the workload. A map of the workload topology is generated with the traces and is used to troubleshoot workload issues and improve performance. 

 **Implementation steps** 

1.  Instrument the applications in your workload to emit transaction traces. This can be done by generating a unique identifier for each transaction and passing the identifier between applications. 

   1.  You can use auto-instrumentation in the [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) to implement traces in your existing applications without modifying your application code. 

1.  Generate maps of your application topology. Use these maps to improve performance, gain insights, and aid in troubleshooting. 

   1.  [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) can generate maps of the applications in your workload. 

 **Level of effort for the implementation plan:** Medium. Implementing transaction traces may require moderate development effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Implement application telemetry](ops_telemetry_application_telemetry.md) - Application telemetry covers transaction traceability and handling and needs to be implementing first. 

 **Related documents:** 
+ [ Discover application issues and get notifications with AWS X-Ray Insights ](https://aws.amazon.com/blogs/mt/discover-application-issues-get-notifications-aws-x-ray-insights/)
+ [ How Wealthfront utilizes AWS X-Ray to analyze and debug distributed applications ](https://aws.amazon.com/blogs/mt/wealthfront-utilizes-aws-x-ray-analyze-debug-distributed-applications/)
+ [ New for AWS Distro for OpenTelemetry – Tracing Support is Now Generally Available ](https://aws.amazon.com/blogs/aws/new-for-aws-distro-for-opentelemetry-tracing-support-is-now-generally-available/)

 **Related videos:** 
+ [AWS re:Invent 2018: Deep Dive into AWS X-Ray: Monitor Modern Applications (DEV324) ](https://www.youtube.com/watch?v=5MQkX57eTh8)
+ [AWS re:Invent 2022 - Building observable applications with OpenTelemetry (BOA310) ](https://www.youtube.com/watch?v=efk8XFJrW2c)
+ [AWS re:Invent 2022 - Observability the open-source way (COP301-R) ](https://www.youtube.com/watch?v=2IJPpdp9xU0)
+ [ Capturing Trace Data with the AWS Distro for OpenTelemetry ](https://www.youtube.com/watch?v=837NtV0McOA)
+ [ Optimize Application Performance with AWS X-Ray](https://www.youtube.com/watch?v=5lIdNrrO_o8)

 **Related examples:** 
+ [AWS X-Ray Multi API Gateway Tracing Example ](https://github.com/aws-samples/aws-xray-multi-api-gateway-tracing-example)

 **Related services:** 
+  [AWS Distro for OpenTelemetry](https://aws-otel.github.io/) 
+  [AWS X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 

# OPS 5. How do you reduce defects, ease remediation, and improve flow into production?
<a name="ops-05"></a>

 Adopt approaches that improve flow of changes into production, that activate refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and achieve rapid identification and remediation of issues introduced through deployment activities. 

**Topics**
+ [OPS05-BP01 Use version control](ops_dev_integ_version_control.md)
+ [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md)
+ [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md)
+ [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md)
+ [OPS05-BP05 Perform patch management](ops_dev_integ_patch_mgmt.md)
+ [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md)
+ [OPS05-BP07 Implement practices to improve code quality](ops_dev_integ_code_quality.md)
+ [OPS05-BP08 Use multiple environments](ops_dev_integ_multi_env.md)
+ [OPS05-BP09 Make frequent, small, reversible changes](ops_dev_integ_freq_sm_rev_chg.md)
+ [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md)

# OPS05-BP01 Use version control
<a name="ops_dev_integ_version_control"></a>

 Use version control to activate tracking of changes and releases. 

 Many AWS services offer version control capabilities. Use a revision or source control system such as [AWS CodeCommit](https://aws.amazon.com/codecommit/) to manage code and other artifacts, such as version-controlled [AWS CloudFormation](https://aws.amazon.com/cloudformation/) templates of your infrastructure. 

 **Common anti-patterns:** 
+  You have been developing and storing your code on your workstation. You have had an unrecoverable storage failure on the workstation your code is lost. 
+  After overwriting the existing code with your changes, you restart your application and it is no longer operable. You are unable to revert to the change. 
+  You have a write lock on a report file that someone else needs to edit. They contact you asking that you stop work on it so that they can complete their tasks. 
+  Your research team has been working on a detailed analysis that will shape your future work. Someone has accidentally saved their shopping list over the final report. You are unable to revert the change and will have to recreate the report. 

 **Benefits of establishing this best practice:** By using version control capabilities you can easily revert to known good states, previous versions, and limit the risk of assets being lost. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use version control: Maintain assets in version controlled repositories. Doing so supports tracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the event of a failure). Integrate the version control capabilities of your configuration management systems into your procedures. 
  +  [Introduction to AWS CodeCommit](https://youtu.be/46PRLMW8otg) 
  +  [What is AWS CodeCommit?](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [What is AWS CodeCommit?](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Introduction to AWS CodeCommit](https://youtu.be/46PRLMW8otg) 

# OPS05-BP02 Test and validate changes
<a name="ops_dev_integ_test_val_chg"></a>

 Every change deployed must be tested to avoid errors in production. This best practice is focused on testing changes from version control to artifact build. Besides application code changes, testing should include infrastructure, configuration, security controls, and operations procedures. Testing takes many forms, from unit tests to software component analysis (SCA). Move tests further to the left in the software integration and delivery process results in higher certainty of artifact quality. 

 Your organization must develop testing standards for all software artifacts. Automated tests reduce toil and avoid manual test errors. Manual tests may be necessary in some cases. Developers must have access to automated test results to create feedback loops that improve software quality. 

 **Desired outcome:** 
+  All software changes are tested before they are delivered. 
+  Developers have access to test results. 
+  Your organization has a testing standard that applies to all software changes. 

 **Common anti-patterns:** 
+ You deploy a new software change without any tests. It fails to run in production, which leads to an outage.
+ New security groups are deployed with CloudFormation without being tested in a pre-production environment. The security groups make your app unreachable for your customers.
+ A method is modified but there are no unit tests. The software fails when it is deployed to production.

 **Benefits of establishing this best practice:** 
+  The change fail rate of software deployments is reduced. 
+  Software quality is improved. 
+  Developers have increased awareness on the viability of their code. 
+  Security policies can be rolled out with confidence to support organization's compliance 
+  Infrastructure changes such as automatic scaling policy updates are tested in advance to meet traffic needs. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Testing is done on all changes, from application code to infrastructure, as part of your continuous integration practice. Test results are published so that developers have fast feedback. Your organization has a testing standard that all changes must pass. 

 **Customer example** 

 As part of their continuous integration pipeline, AnyCompany Retail conducts several types of tests on all software artifacts. They practice test driven development so all software has unit tests. Once the artifact is built, they run end-to-end tests. After this first round of tests is complete, they run a static application security scan, which looks for known vulnerabilities. Developers receive messages as each testing gate is passed. Once all tests are complete, the software artifact is stored in an artifact repository. 

 **Implementation steps** 

1.  Work with stakeholders in your organization to develop a testing standard for software artifacts. What standard tests should all artifacts pass? Are there compliance or governance requirements that must be included in the test coverage? Do you need to conduct code quality tests? When tests complete, who needs to know? 

   1.  The [AWS Deployment Pipeline Reference Architecture](https://pipelines.devops.aws.dev/) contains an authoritative list of types of tests that can be conducted on software artifacts as part of an integration pipeline. 

1.  Instrument your application with the necessary tests based on your software testing standard. Each set of tests should complete in under ten minutes. Tests should run as part of an integration pipeline. 

   1.  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) can test your application code for defects. 

   1.  You can use [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) to conduct tests on software artifacts. 

   1.  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) can orchestrate your software tests into a pipeline. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP01 Use version control](ops_dev_integ_version_control.md) - All software artifacts must be backed by a version-controlled repository. 
+  [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md) - Your organizations software testing standards inform your design standards. 
+  [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md) - Software tests should be automatically run as part of your larger integration and deployment pipeline. 

 **Related documents:** 
+ [ Adopt a test-driven development approach ](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html)
+ [ Automated CloudFormation Testing Pipeline with TaskCat and CodePipeline ](https://aws.amazon.com/blogs/devops/automated-cloudformation-testing-pipeline-with-taskcat-and-codepipeline/)
+ [ Building end-to-end AWS DevSecOps CI/CD pipeline with open source SCA, SAST, and DAST tools ](https://aws.amazon.com/blogs/devops/building-end-to-end-aws-devsecops-ci-cd-pipeline-with-open-source-sca-sast-and-dast-tools/)
+ [ Getting started with testing serverless applications ](https://aws.amazon.com/blogs/compute/getting-started-with-testing-serverless-applications/)
+ [ My CI/CD pipeline is my release captain ](https://aws.amazon.com/builders-library/cicd-pipeline/)
+ [ Practicing Continuous Integration and Continuous Delivery on AWS Whitepaper ](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/welcome.html)

 **Related videos:** 
+ [AWS re:Invent 2020: Testable infrastructure: Integration testing on AWS](https://www.youtube.com/watch?v=KJC380Juo2w)
+ [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development ](https://www.youtube.com/watch?v=1R7G_wcyd3s)
+ [ Testing Your Infrastructure as Code with AWS CDK ](https://www.youtube.com/watch?v=fWtuwGSoSOU)

 **Related resources:** 
+ [AWS Deployment Pipeline Reference Architecture - Application ](https://pipelines.devops.aws.dev/application-pipeline/index.html)
+ [AWS Kubernetes DevSecOps Pipeline ](https://github.com/aws-samples/devsecops-cicd-containers)
+ [ Policy as Code Workshop – Test Driven Development ](https://catalog.us-east-1.prod.workshops.aws/workshops/9da471a0-266a-4d36-8596-e5934aeedd1f/en-US/pac-tools/cfn-guard/tdd)
+ [ Run unit tests for a Node.js application from GitHub by using AWS CodeBuild](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-unit-tests-for-a-node-js-application-from-github-by-using-aws-codebuild.html)
+ [ Use Serverspec for test-driven development of infrastructure code ](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/use-serverspec-for-test-driven-development-of-infrastructure-code.html)

 **Related services:** 
+  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) 
+  [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 

# OPS05-BP03 Use configuration management systems
<a name="ops_dev_integ_conf_mgmt_sys"></a>

 Use configuration management systems to make and track configuration changes. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

 Static configuration management sets values when initializing a resource that are expected to remain consistent throughout the resource’s lifetime. Some examples include setting the configuration for a web or application server on an instance, or defining the configuration of an AWS service within the [AWS Management Console](https://docs.aws.amazon.com/awsconsolehelpdocs/index.html) or through the [AWS CLI](https://aws.amazon.com/cli/). 

 Dynamic configuration management sets values at initialization that can or are expected to change during the lifetime of a resource. For example, you could set a feature toggle to activate functionality in your code through a configuration change, or change the level of log detail during an incident to capture more data and then change back following the incident eliminating the now unnecessary logs and their associated expense. 

 If you have dynamic configurations in your applications running on instances, containers, serverless functions, or devices, you can use [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) to manage and deploy them across your environments. 

 On AWS, you can use [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to continuously monitor your AWS resource configurations [across accounts and Regions](https://docs.aws.amazon.com/config/latest/developerguide/aggregate-data.html). It helps you to track their configuration history, understand how a configuration change would affect other resources, and audit them against expected or desired configurations using [AWS Config Rules](https://docs.aws.amazon.com/config/latest/developerguide/evaluate-config.html) and [AWS Config Conformance Packs](https://docs.aws.amazon.com/config/latest/developerguide/conformance-packs.html). 

 On AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using services such as [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) (for example, AWS CodeCommit, [AWS CodeBuild](https://aws.amazon.com/codebuild/), [AWS CodePipeline](https://aws.amazon.com/codepipeline/), [AWS CodeDeploy](https://aws.amazon.com/codedeploy/), and [AWS CodeStar](https://aws.amazon.com/codestar/)). 

 Have a change calendar and track when significant business or operational activities or events are planned that may be impacted by implementation of change. Adjust activities to manage risk around those plans. [AWS Systems Manager Change Calendar](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-change-calendar.html) provides a mechanism to document blocks of time as open or closed to changes and why, and [share that information](https://docs.aws.amazon.com/systems-manager/latest/userguide/change-calendar-share.html) with other AWS accounts. AWS Systems Manager Automation scripts can be configured to adhere to the change calendar state. 

 [AWS Systems Manager Maintenance Windows](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html) can be used to schedule the performance of AWS SSM Run Command or Automation scripts, AWS Lambda invocations, or AWS Step Functions activities at specified times. Mark these activities in your change calendar so that they can be included in your evaluation. 

 **Common anti-patterns:** 
+  You manually update the web server configuration across your fleet and a number of servers become unresponsive due to update errors. 
+  You manually update your application server fleet over the course of many hours. The inconsistency in configuration during the change causes unexpected behaviors. 
+  Someone has updated your security groups and your web servers are no longer accessible. Without knowledge of what was changed you spend significant time investigating the issue extending your time to recovery. 

 **Benefits of establishing this best practice:** Adopting configuration management systems reduces the level of effort to make and track changes, and the frequency of errors caused by manual procedures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use configuration management systems: Use configuration management systems to track and implement changes, to reduce errors caused by manual processes, and reduce the level of effort. 
  +  [Infrastructure configuration management](https://aws.amazon.com/answers/configuration-management/aws-infrastructure-configuration-management/) 
  +  [AWS Config](https://aws.amazon.com/config/) 
  +  [What is AWS Config?](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) 
  +  [Introduction to AWS CloudFormation](https://youtu.be/Omppm_YUG2g) 
  +  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
  +  [AWS OpsWorks](https://aws.amazon.com/opsworks/) 
  +  [What is AWS OpsWorks?](https://docs.aws.amazon.com/opsworks/latest/userguide/welcome.html) 
  +  [Introduction to AWS Elastic Beanstalk](https://youtu.be/SrwxAScdyT0) 
  +  [What is AWS Elastic Beanstalk?](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [AWS OpsWorks](https://aws.amazon.com/opsworks/) 
+  [AWS Systems Manager Change Calendar](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-change-calendar.html) 
+  [AWS Systems Manager Maintenance Windows](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html) 
+  [Infrastructure configuration management](https://aws.amazon.com/answers/configuration-management/aws-infrastructure-configuration-management/) 
+  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
+  [What is AWS Config?](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) 
+  [What is AWS Elastic Beanstalk?](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) 
+  [What is AWS OpsWorks?](https://docs.aws.amazon.com/opsworks/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Introduction to AWS CloudFormation](https://youtu.be/Omppm_YUG2g) 
+  [Introduction to AWS Elastic Beanstalk](https://youtu.be/SrwxAScdyT0) 

# OPS05-BP04 Use build and deployment management systems
<a name="ops_dev_integ_build_mgmt_sys"></a>

 Use build and deployment management systems. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

 In AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using services such as [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) (for example, AWS CodeCommit, [AWS CodeBuild](https://aws.amazon.com/codebuild/), [AWS CodePipeline](https://aws.amazon.com/codepipeline/), [AWS CodeDeploy](https://aws.amazon.com/codedeploy/), and [AWS CodeStar](https://aws.amazon.com/codestar/)). 

 **Common anti-patterns:** 
+  After compiling your code on your development system you, copy the executable onto your production systems and it fails to start. The local log files indicates that it has failed due to missing dependencies. 
+  You successfully build your application with new features in your development environment and provide the code to Quality Assurance (QA). It fails QA because it is missing static assets. 
+  On Friday, after much effort, you successfully built your application manually in your development environment including your newly coded features. On Monday, you are unable to repeat the steps that allowed you to successfully build your application. 
+  You perform the tests you have created for your new release. Then you spend the next week setting up a test environment and performing all the existing integration tests followed by the performance tests. The new code has an unacceptable performance impact and must be redeveloped and then retested. 

 **Benefits of establishing this best practice:** By providing mechanisms to manage build and deployment activities you reduce the level of effort to perform repetitive tasks, free your team members to focus on their high value creative tasks, and limit the introduction of error from manual procedures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use build and deployment management systems: Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, and reduces the level of effort. 
  +  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
  +  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
  +  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
  +  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
  +  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
+  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
+  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 

# OPS05-BP05 Perform patch management
<a name="ops_dev_integ_patch_mgmt"></a>

 Perform patch management to gain features, address issues, and remain compliant with governance. Automate patch management to reduce errors caused by manual processes, and reduce the level of effort to patch. 

 Patch and vulnerability management are part of your benefit and risk management activities. It is preferable to have immutable infrastructures and deploy workloads in verified known good states. Where that is not viable, patching in place is the remaining option. 

 Updating machine images, container images, or Lambda [custom runtimes and additional libraries](https://docs.aws.amazon.com/lambda/latest/dg/security-configuration.html) to remove vulnerabilities are part of patch management. You should manage updates to [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) (AMIs) for Linux or Windows Server images using [EC2 Image Builder](https://aws.amazon.com/image-builder/). You can use [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) with your existing pipeline to [manage Amazon ECS images](https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_on_ECS.html) and [manage Amazon EKS images](https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_on_EKS.html). AWS Lambda includes [version](https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html) management features. 

 Patching should not be performed on production systems without first testing in a safe environment. Patches should only be applied if they support an operational or business outcome. On AWS, you can use [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) to automate the process of patching managed systems and schedule the activity using [AWS Systems Manager Maintenance Windows](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html). 

 **Common anti-patterns:** 
+  You are given a mandate to apply all new security patches within two hours resulting in multiple outages due to application incompatibility with patches. 
+  An unpatched library results in unintended consequences as unknown parties use vulnerabilities within it to access your workload. 
+  You patch the developer environments automatically without notifying the developers. You receive multiple complaints from the developers that their environment cease to operate as expected. 
+  You have not patched the commercial off-the-self software on a persistent instance. When you have an issue with the software and contact the vendor, they notify you that version is not supported and you will have to patch to a specific level to receive any assistance. 
+  A recently released patch for the encryption software you used has significant performance improvements. Your unpatched system has performance issues that remain in place as a result of not patching. 

 **Benefits of establishing this best practice:** By establishing a patch management process, including your criteria for patching and methodology for distribution across your environments, you will be able to realize their benefits and control their impact. This will encourage the adoption of desired features and capabilities, the removal of issues, and sustained compliance with governance. Implement patch management systems and automation to reduce the level of effort to deploy patches and limit errors caused by manual processes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Patch management: Patch systems to remediate issues, to gain desired features or capabilities, and to remain compliant with governance policy and vendor support requirements. In immutable systems, deploy with the appropriate patch set to achieve the desired result. Automate the patch management mechanism to reduce the elapsed time to patch, to reduce errors caused by manual processes, and reduce the level of effort to patch. 
  +  [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 

 **Related videos:** 
+  [CI/CD for Serverless Applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
+  [Design with Ops in Mind](https://youtu.be/uh19jfW7hw4) 

   **Related examples:** 
+  [Well-Architected Labs – Inventory and Patch Management](https://wellarchitectedlabs.com/operational-excellence/100_labs/100_inventory_patch_management/) 

# OPS05-BP06 Share design standards
<a name="ops_dev_integ_share_design_stds"></a>

Share best practices across teams to increase awareness and maximize the benefits of development efforts. Document them and keep them up to date as your architecture evolves. If shared standards are enforced in your organization, it’s critical that mechanisms exist to request additions, changes, and exceptions to standards. Without this option, standards become a constraint on innovation. 

 **Desired outcome:** 
+  Design standards are shared across teams in your organizations. 
+  They are documented and kept up to date as best practices evolve. 

 **Common anti-patterns:** 
+ Two development teams have each created a user authentication service. Your users must maintain a separate set of credentials for each part of the system they want to access. 
+ Each team manages their own infrastructure. A new compliance requirement forces a change to your infrastructure and each team implements it in a different way.

 **Benefits of establishing this best practice:** 
+  Using shared standards supports the adoption of best practices and to maximizes the benefits of development efforts. 
+  Documenting and updating design standards keeps your organization up to date with best practices and security and compliance requirements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Share existing best practices, design standards, checklists, operating procedures, guidance, and governance requirements across teams. Have procedures to request changes, additions, and exceptions to design standards to support improvement and innovation. Make teams are aware of published content. Have a mechanism to keep design standards up to date as new best practices emerge. 

 **Customer example** 

 AnyCompany Retail has a cross-functional architecture team that creates software architecture patterns. This team builds the architecture with compliance and governance built in. Teams that adopt these shared standards get the benefits of having compliance and governance built in. They can quickly build on top of the design standard. The architecture team meets quarterly to evaluate architecture patterns and update them if necessary. 

 **Implementation steps** 

1.  Identify a cross-functional team that will own developing and updating design standards. This team will work with stakeholders across your organization to develop design standards, operating procedures, checklists, guidance, and governance requirements. Document the design standards and share them within your organization. 

   1.  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) can be used to create portfolios representing design standards using infrastructure as code. You can share portfolios across accounts. 

1.  Have a mechanism in place to keep design standards up to date as new best practices are identified. 

1.  If design standards are centrally enforced, have a process to request changes, updates, and exemptions. 

 **Level of effort for the implementation plan:** Medium. Developing a process to create and share design standards can take coordination and cooperation with stakeholders across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements influence design standards. 
+  [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) - Compliance is a vital input in creating design standards. 
+  [OPS07-BP02 Ensure a consistent review of operational readiness](ops_ready_to_support_const_orr.md) - Operational readiness checklists are a mechanism to implement design standards when designing your workload. 
+  [OPS11-BP01 Have a process for continuous improvement](ops_evolve_ops_process_cont_imp.md) - Updating design standards is a part of continuous improvement. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - As part of your knowledge management practice, document and share design standards. 

 **Related documents:** 
+ [ Automate AWS Backups with AWS Service Catalog](https://aws.amazon.com/blogs/mt/automate-aws-backups-with-aws-service-catalog/)
+ [AWS Service Catalog Account Factory-Enhanced ](https://aws.amazon.com/blogs/mt/aws-service-catalog-account-factory-enhanced/)
+ [ How Expedia Group built Database as a Service (DBaaS) offering using AWS Service Catalog](https://aws.amazon.com/blogs/mt/how-expedia-group-built-database-as-a-service-dbaas-offering-using-aws-service-catalog/)
+ [ Maintain visibility over the use of cloud architecture patterns ](https://aws.amazon.com/blogs/architecture/maintain-visibility-over-the-use-of-cloud-architecture-patterns/)
+ [ Simplify sharing your AWS Service Catalog portfolios in an AWS Organizations setup ](https://aws.amazon.com/blogs/mt/simplify-sharing-your-aws-service-catalog-portfolios-in-an-aws-organizations-setup/)

 **Related videos:** 
+ [AWS Service Catalog – Getting Started ](https://www.youtube.com/watch?v=A9kKy6WhqVA)
+ [AWS re:Invent 2020: Manage your AWS Service Catalog portfolios like an expert ](https://www.youtube.com/watch?v=lVfXkWHAtR8)

 **Related examples:** 
+ [AWS Service Catalog Reference Architecture ](https://github.com/aws-samples/aws-service-catalog-reference-architectures)
+ [AWS Service Catalog Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/d40750d7-a330-49be-9945-cde864610de9/en-US)

 **Related services:** 
+  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) 

# OPS05-BP07 Implement practices to improve code quality
<a name="ops_dev_integ_code_quality"></a>

Implement practices to improve code quality and minimize defects. Some examples include test-driven development, code reviews, standards adoption, and pair programming. Incorporate these practices into your continuous integration and delivery process. 

 **Desired outcome:** 
+  Your organization uses best practices like code reviews or pair programming to improve code quality. 
+  Developers and operators adopt code quality best practices as part of the software development lifecycle. 

 **Common anti-patterns:** 
+ You commit code to the main branch of your application without a code review. The change automatically deploys to production and causes an outage.
+  A new application is developed without any unit, end-to-end, or integration tests. There is no way to test the application before deployment. 
+  Your teams make manual changes in production to address defects. Changes do not go through testing or code reviews and are not captured or logged through continuous integration and delivery processes. 

 **Benefits of establishing this best practice:** 
+  By adopting practices to improve code quality, you can help minimize issues introduced to production. 
+  Code quality increases using best practices like pair programming and code reviews. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement practices to improve code quality to minimize defects before they are deployed. Use practices like test-driven development, code reviews, and pair programming to increase the quality of your development. 

 **Customer example** 

 AnyCompany Retail adopts several practices to improve code quality. They have adopted test-driven development as the standard for writing applications. For some new features, they will have developers pair program together during a sprint. Every pull request goes through a code review by a senior developer before being integrated and deployed. 

 **Implementation steps** 

1.  Adopt code quality practices like test-driven development, code reviews, and pair programming into your continuous integration and delivery process. Use these techniques to improve software quality. 

   1.  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) can provide programming recommendations for Java and Python code using machine learning. 

   1.  You can create shared development environments with [AWS Cloud9](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) where you can collaborate on developing code. 

 **Level of effort for the implementation plan:** Medium. There are many ways of implementing this best practice, but getting organizational adoption may be challenging. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP06 Share design standards](ops_dev_integ_share_design_stds.md) - You can share design standards as part of your code quality practice. 

 **Related documents:** 
+ [ Agile Software Guide ](https://martinfowler.com/agile.html)
+ [My CI/CD pipeline is my release captain](https://aws.amazon.com/builders-library/cicd-pipeline/)
+ [ Automate code reviews with Amazon CodeGuru Reviewer ](https://aws.amazon.com/blogs/devops/automate-code-reviews-with-amazon-codeguru-reviewer/)
+ [ Adopt a test-driven development approach ](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html)
+ [ How DevFactory builds better applications with Amazon CodeGuru ](https://aws.amazon.com/blogs/machine-learning/how-devfactory-builds-better-applications-with-amazon-codeguru/)
+ [ On Pair Programming ](https://martinfowler.com/articles/on-pair-programming.html)
+ [ RENGA Inc. automates code reviews with Amazon CodeGuru ](https://aws.amazon.com/blogs/machine-learning/renga-inc-automates-code-reviews-with-amazon-codeguru/)
+ [ The Art of Agile Development: Test-Driven Development ](http://www.jamesshore.com/v2/books/aoad1/test_driven_development)
+ [ Why code reviews matter (and actually save time\$1) ](https://www.atlassian.com/agile/software-development/code-reviews)

 **Related videos:** 
+ [AWS re:Invent 2020: Continuous improvement of code quality with Amazon CodeGuru ](https://www.youtube.com/watch?v=iX1i35H1OVw)
+ [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development ](https://www.youtube.com/watch?v=1R7G_wcyd3s)

 **Related services:** 
+ [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html)
+ [ Amazon CodeGuru Profiler ](https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html)
+  [AWS Cloud9](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) 

# OPS05-BP08 Use multiple environments
<a name="ops_dev_integ_multi_env"></a>

 Use multiple environments to experiment, develop, and test your workload. Use increasing levels of controls as environments approach production to gain confidence your workload will operate as intended when deployed. 

 **Common anti-patterns:** 
+  You are performing development in a shared development environment and another developer overwrites your code changes. 
+  The restrictive security controls on your shared development environment are preventing you from experimenting with new services and features. 
+  You perform load testing on your production systems and cause an outage for your users. 
+  A critical error resulting in data loss has occurred in production. In your production environment, you attempt to recreate the conditions that lead to the data loss so that you can identify how it happened and prevent it from happening again. To prevent further data loss during testing, you are forced to make the application unavailable to your users. 
+  You are operating a multi-tenant service and are unable to support a customer request for a dedicated environment. 
+  You may not always test, but when you do it’s in production. 
+  You believe that the simplicity of a single environment overrides the scope of impact of changes within the environment. 

 **Benefits of establishing this best practice:** By deploying multiple environments you can support multiple simultaneous development, testing, and production environments without creating conflicts between developers or user communities. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use multiple environments: Provide developers sandbox environments with minimized controls to aid in experimentation. Provide individual development environments to help work in parallel, increasing development agility. Implement more rigorous controls in the environments approaching production to allow developers to innovate. Use infrastructure as code and configuration management systems to deploy environments that are configured consistent with the controls present in production to ensure systems operate as expected when deployed. When environments are not in use, turn them off to avoid costs associated with idle resources (for example, development systems on evenings and weekends). Deploy production equivalent environments when load testing to improve valid results. 
  +  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
  +  [How do I stop and start Amazon EC2 instances at regular intervals using AWS Lambda?](https://aws.amazon.com/premiumsupport/knowledge-center/start-stop-lambda-cloudwatch/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [How do I stop and start Amazon EC2 instances at regular intervals using AWS Lambda?](https://aws.amazon.com/premiumsupport/knowledge-center/start-stop-lambda-cloudwatch/) 
+  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 

# OPS05-BP09 Make frequent, small, reversible changes
<a name="ops_dev_integ_freq_sm_rev_chg"></a>

 Frequent, small, and reversible changes reduce the scope and impact of a change. This eases troubleshooting, helps with faster remediation, and provides the option to roll back a change. 

 **Common anti-patterns:** 
+  You deploy a new version of your application quarterly. 
+  You frequently make changes to your database schema. 
+  You perform manual in-place updates, overwriting existing installations and configurations. 

 **Benefits of establishing this best practice:** You recognize benefits from development efforts faster by deploying small changes frequently. When the changes are small, it is much easier to identify if they have unintended consequences. When the changes are reversible, there is less risk to implementing the change as recovery is simplified. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Make frequent, small, reversible changes: Frequent, small, and reversible changes reduce the scope and impact of a change. This eases troubleshooting, helps with faster remediation, and provides the option to roll back a change. It also increases the rate at which you can deliver value to the business. 

# OPS05-BP10 Fully automate integration and deployment
<a name="ops_dev_integ_auto_integ_deploy"></a>

 Automate build, deployment, and testing of the workload. This reduces errors caused by manual processes and reduces the effort to deploy changes. 

 Apply metadata using [Resource Tags](https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html) and [AWS Resource Groups](https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html) following a consistent [tagging strategy](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/) to aid in identification of your resources. Tag your resources for organization, cost accounting, access controls, and targeting the run of automated operations activities. 

 **Common anti-patterns:** 
+  On Friday you, finish authoring the new code for your feature branch. On Monday, after running your code quality test scripts and each of your unit tests scripts, you will check in your code for the next scheduled release. 
+  You are assigned to code a fix for a critical issue impacting a large number of customers in production. After testing the fix, you commit your code and email change management to request approval to deploy it to production. 

 **Benefits of establishing this best practice:** By implementing automated build and deployment management systems, you reduce errors caused by manual processes and reduce the effort to deploy changes helping your team members to focus on delivering business value. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use build and deployment management systems: Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, and reduces the level of effort. 
  +  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
  +  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
  +  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
  +  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
  +  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
+  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
+  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 

# OPS 6. How do you mitigate deployment risks?
<a name="ops-06"></a>

 Adopt approaches that provide fast feedback on quality and achieve rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes. 

**Topics**
+ [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md)
+ [OPS06-BP02 Test and validate changes](ops_mit_deploy_risks_test_val_chg.md)
+ [OPS06-BP03 Use deployment management systems](ops_mit_deploy_risks_deploy_mgmt_sys.md)
+ [OPS06-BP04 Test using limited deployments](ops_mit_deploy_risks_test_limited_deploy.md)
+ [OPS06-BP05 Deploy using parallel environments](ops_mit_deploy_risks_deploy_to_parallel_env.md)
+ [OPS06-BP06 Deploy frequent, small, reversible changes](ops_mit_deploy_risks_freq_sm_rev_chg.md)
+ [OPS06-BP07 Fully automate integration and deployment](ops_mit_deploy_risks_auto_integ_deploy.md)
+ [OPS06-BP08 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md)

# OPS06-BP01 Plan for unsuccessful changes
<a name="ops_mit_deploy_risks_plan_for_unsucessful_changes"></a>

 Plan to revert to a known good state, or remediate in the production environment if a change does not have the desired outcome. This preparation reduces recovery time through faster responses. 

 **Common anti-patterns:** 
+  You performed a deployment and your application has become unstable but there appear to be active users on the system. You have to decide whether to roll back the change and impact the active users or wait to roll back the change knowing the users may be impacted regardless. 
+  After making a routine change, your new environments are accessible but one of your subnets has become unreachable. You have to decide whether to roll back everything or try to fix the inaccessible subnet. While you are making that determination, the subnet remains unreachable. 

 **Benefits of establishing this best practice:** Having a plan in place reduces the mean time to recover (MTTR) from unsuccessful changes, reducing the impact to your end users. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Plan for unsuccessful changes: Plan to revert to a known good state (that is, roll back the change), or remediate in the production environment (that is, roll forward the change) if a change does not have the desired outcome. When you identify changes that you cannot roll back if unsuccessful, apply due diligence prior to committing the change. 

# OPS06-BP02 Test and validate changes
<a name="ops_mit_deploy_risks_test_val_chg"></a>

 Test changes and validate the results at all lifecycle stages to confirm new features and minimize the risk and impact of failed deployments. 

 On AWS, you can create temporary parallel environments to lower the risk, effort, and cost of experimentation and testing. Automate the deployment of these environments using [AWS CloudFormation](https://aws.amazon.com/cloudformation/) to ensure consistent implementations of your temporary environments. 

 **Common anti-patterns:** 
+  You deploy a cool new feature to your application. It doesn't work. You don't know. 
+  You update your certificates. You accidentally install the certificates to the wrong components. You don't know. 

 **Benefits of establishing this best practice:** By testing and validating changes following deployment you are able to identify issues early providing an opportunity to mitigate the impact on your customers. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Test and validate changes: Test changes and validate the results at all lifecycle stages (for example, development, test, and production), to confirm new features and minimize the risk and impact of failed deployments. 
  +  [AWS Cloud9](https://aws.amazon.com/cloud9/) 
  +  [What is AWS Cloud9?](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) 
  +  [How to test and debug AWS CodeDeploy locally before you ship your code](https://aws.amazon.com/blogs/devops/how-to-test-and-debug-aws-codedeploy-locally-before-you-ship-your-code/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Cloud9](https://aws.amazon.com/cloud9/) 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [How to test and debug AWS CodeDeploy locally before you ship your code](https://aws.amazon.com/blogs/devops/how-to-test-and-debug-aws-codedeploy-locally-before-you-ship-your-code/) 
+  [What is AWS Cloud9?](https://docs.aws.amazon.com/cloud9/latest/user-guide/welcome.html) 

# OPS06-BP03 Use deployment management systems
<a name="ops_mit_deploy_risks_deploy_mgmt_sys"></a>

 Use deployment management systems to track and implement change. This reduces errors caused by manual processes and reduces the effort to deploy changes. 

 In AWS, you can build Continuous Integration/Continuous Deployment (CI/CD) pipelines using services such as [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) (for example, AWS CodeCommit, [AWS CodeBuild](https://aws.amazon.com/codebuild/), [AWS CodePipeline](https://aws.amazon.com/codepipeline/), [AWS CodeDeploy](https://aws.amazon.com/codedeploy/), and [AWS CodeStar](https://aws.amazon.com/codestar/)). 

 **Common anti-patterns:** 
+  You manually deploy updates to the application servers across your fleet and a number of servers become unresponsive due to update errors. 
+  You manually deploy to your application server fleet over the course of many hours. The inconsistency in versions during the change causes unexpected behaviors. 

 **Benefits of establishing this best practice:** Adopting deployment management systems reduces the level of effort to deploy changes, and the frequency of errors caused by manual procedures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use deployment management systems: Use deployment management systems to track and implement change. This will reduce errors caused by manual processes, and reduce the level of effort to deploy changes. Automate the integration and deployment pipeline from code check-in through testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, and further reduces the level of effort. 
  +  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
  +  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
  +  [What is AWS Elastic Beanstalk?](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) 
  +  [What is Amazon API Gateway?](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
+  [What is AWS Elastic Beanstalk?](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) 
+  [What is Amazon API Gateway?](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html) 

 **Related videos:** 
+  [Deep Dive on Advanced Continuous Delivery Techniques Using AWS](https://www.youtube.com/watch?v=Lrrgd0Kemhw) 
+  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 

# OPS06-BP04 Test using limited deployments
<a name="ops_mit_deploy_risks_test_limited_deploy"></a>

 Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments. 

 **Common anti-patterns:** 
+  You deploy an unsuccessful change to all of production all at once. You don't know. 

 **Benefits of establishing this best practice:** By testing and validating changes following limited deployment you are able to identify issues early with minimal impact on your customers providing an opportunity to further mitigate the impact on your customers. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Test using limited deployments: Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments. 
  +  [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
  +  [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html) 
  +  [Set up an API Gateway canary release deployment](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html) 
  +  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 
  +  [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
+  [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html) 
+  [Set up an API Gateway canary release deployment](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html) 
+  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 
+  [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html) 

# OPS06-BP05 Deploy using parallel environments
<a name="ops_mit_deploy_risks_deploy_to_parallel_env"></a>

 Implement changes onto parallel environments, and then transition over to the new environment. Maintain the prior environment until there is confirmation of successful deployment. Doing so minimizes recovery time by permitting rollback to the previous environment. 

 **Common anti-patterns:** 
+  You perform a mutable deployment by modifying your existing systems. After discovering that the change was unsuccessful, you are forced to modify the systems again to restore the old version extending your time to recovery. 
+  During a maintenance window, you decommission the old environment and then start building your new environment. Many hours into the procedure, you discover unrecoverable issues with the deployment. While extremely tired, you are forced to find the previous deployment procedures and start rebuilding the old environment. 

 **Benefits of establishing this best practice:** By using parallel environments, you can pre-deploy the new environment and transition over to them when desired. If the new environment is not successful, you can recover quickly by transitioning back to your original environment. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Deploy using parallel environments: Implement changes onto parallel environments, and transition or cut over to the new environment. Maintain the prior environment until there is confirmation of successful deployment. This minimizes recovery time by permitting rollback to the previous environment. For example, use immutable infrastructures with blue/green deployments. 
  +  [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html) 
  +  [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html) 
  +  [Set up an API Gateway canary release deployment](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html) 
  +  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
+  [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html) 
+  [Set up an API Gateway canary release deployment](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html) 
+  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 
+  [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html) 

 **Related videos:** 
+  [Deep Dive on Advanced Continuous Delivery Techniques Using AWS](https://www.youtube.com/watch?v=Lrrgd0Kemhw) 

# OPS06-BP06 Deploy frequent, small, reversible changes
<a name="ops_mit_deploy_risks_freq_sm_rev_chg"></a>

 Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change. 

 **Common anti-patterns:** 
+  You deploy a new version of your application quarterly. 
+  You frequently make changes to your database schema. 
+  You perform manual in-place updates, overwriting existing installations and configurations. 

 **Benefits of establishing this best practice:** You recognize benefits from development efforts faster by deploying small changes frequently. When the changes are small it is much easier to identify if they have unintended consequences. When the changes are reversible there is less risk to implementing the change as recovery is simplified. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Deploy frequent, small, reversible changes: Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change. 

# OPS06-BP07 Fully automate integration and deployment
<a name="ops_mit_deploy_risks_auto_integ_deploy"></a>

 Automate build, deployment, and testing of the workload. This reduces errors cause by manual processes and reduces the effort to deploy changes. 

 Apply metadata using [Resource Tags](https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html) and [AWS Resource Groups](https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html) following a consistent [tagging strategy](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/) to aid in identification of your resources. Tag your resources for organization, cost accounting, access controls, and targeting the run of automated operations activities. 

 **Common anti-patterns:** 
+  On Friday, you finish authoring the new code for your feature branch. On Monday, after running your code quality test scripts and each of your unit tests scripts, you will check in your code for the next scheduled release. 
+  You are assigned to code a fix for a critical issue impacting a large number of customers in production. After testing the fix, you commit your code and email change management to request approval to deploy it to production. 

 **Benefits of establishing this best practice:** By implementing automated build and deployment management systems you reduce errors caused by manual processes and reduce the effort to deploy changes helping your team members to focus on delivering business value. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Use build and deployment management systems: Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, and reduces the level of effort. 
  +  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
  +  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
  +  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
  +  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
  +  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 
  +  [Deep Dive on Advanced Continuous Delivery Techniques Using AWS](https://www.youtube.com/watch?v=Lrrgd0Kemhw) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html) 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Continuous integration best practices for software development](https://www.youtube.com/watch?v=GEPJ7Lo346A) 
+  [Deep Dive on Advanced Continuous Delivery Techniques Using AWS](https://www.youtube.com/watch?v=Lrrgd0Kemhw) 
+  [Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services](https://www.youtube.com/watch?v=Wx-ain8UryM) 
+  [Slalom: CI/CD for serverless applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 

# OPS06-BP08 Automate testing and rollback
<a name="ops_mit_deploy_risks_auto_testing_and_rollback"></a>

 Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes. 

 **Common anti-patterns:** 
+  You deploy changes to your workload. After your see that the change is complete, you start post deployment testing. After you see that they are complete, you realize that your workload is inoperable and customers are disconnected. You then begin rolling back to the previous version. After an extended time to detect the issue, the time to recover is extended by your manual redeployment. 

 **Benefits of establishing this best practice:** By testing and validating changes following deployment, you are able to identify issues immediately. By automatically rolling back to the previous version, the impact on your customers is minimized. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate testing and rollback: Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes. For example, perform detailed synthetic user transactions following deployment, verify the results, and roll back on failure. 
  +  [Redeploy and roll back a deployment with AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployments-rollback-and-redeploy.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Redeploy and roll back a deployment with AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployments-rollback-and-redeploy.html) 

# OPS 7. How do you know that you are ready to support a workload?
<a name="ops-07"></a>

 Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand the operational risks related to your workload. 

**Topics**
+ [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md)
+ [OPS07-BP02 Ensure a consistent review of operational readiness](ops_ready_to_support_const_orr.md)
+ [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md)
+ [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md)
+ [OPS07-BP05 Make informed decisions to deploy systems and changes](ops_ready_to_support_informed_deploy_decisions.md)
+ [OPS07-BP06 Create support plans for production workloads](ops_ready_to_support_enable_support_plans.md)

# OPS07-BP01 Ensure personnel capability
<a name="ops_ready_to_support_personnel_capability"></a>

Have a mechanism to validate that you have the appropriate number of trained personnel to support the workload. They must be trained on the platform and services that make up your workload. Provide them with the knowledge necessary to operate the workload. You must have enough trained personnel to support the normal operation of the workload and troubleshoot any incidents that occur. Have enough personnel so that you can rotate during on-call and vacations to avoid burnout. 

 **Desired outcome:** 
+  There are enough trained personnel to support the workload at times when the workload is available. 
+  You provide training for your personnel on the software and services that make up your workload. 

 **Common anti-patterns:** 
+ Deploying a workload without team members trained to operate the platform and services in use. 
+  Not having enough personnel to support on-call rotations or personnel taking time off. 

 **Benefits of establishing this best practice:** 
+  Having skilled team members helps effective support of your workload. 
+  With enough team members, you can support the workload and on-call rotations while decreasing the risk of burnout. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Validate that there are sufficient trained personnel to support the workload. Verify that you have enough team members to cover normal operational activities, including on-call rotations. 

 **Customer example** 

 AnyCompany Retail makes sure that teams supporting the workload are properly staffed and trained. They have enough engineers to support an on-call rotation. Personnel get training on the software and platform that the workload is built on and are encouraged to earn certifications. There are enough personnel so that people can take time off while still supporting the workload and the on-call rotation. 

 **Implementation steps** 

1.  Assign an adequate number of personnel to operate and support your workload, including on-call duties. 

1.  Train your personnel on the software and platforms that compose your workload. 

   1.  [AWS Training and Certification](https://aws.amazon.com/training/) has a library of courses about AWS. They provide free and paid courses, online and in-person. 

   1.  [AWS hosts events and webinars](https://aws.amazon.com/events/) where you learn from AWS experts. 

1.  Regularly evaluate team size and skills as operating conditions and the workload change. Adjust team size and skills to match operational requirements. 

 **Level of effort for the implementation plan:** High. Hiring and training a team to support a workload can take significant effort but has substantial long-term benefits. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - Team members must have the information necessary to operate and support the workload. Knowledge management is the key to providing that. 

 **Related documents:** 
+  [AWS Events and Webinars](https://aws.amazon.com/events/) 
+  [AWS Training and Certification](https://aws.amazon.com/training/) 

# OPS07-BP02 Ensure a consistent review of operational readiness
<a name="ops_ready_to_support_const_orr"></a>

Use Operational Readiness Reviews (ORRs) to validate that you can operate your workload. ORR is a mechanism developed at Amazon to validate that teams can safely operate their workloads. An ORR is a review and inspection process using a checklist of requirements. An ORR is a self-service experience that teams use to certify their workloads. ORRs include best practices from lessons learned from our years of building software. 

 An ORR checklist is composed of architectural recommendations, operational process, event management, and release quality. Our Correction of Error (CoE) process is a major driver of these items. Your own post-incident analysis should drive the evolution of your own ORR. An ORR is not only about following best practices but preventing the recurrence of events that you’ve seen before. Lastly, security, governance, and compliance requirements can also be included in an ORR. 

 Run ORRs before a workload launches to general availability and then throughout the software development lifecycle. Running the ORR before launch increases your ability to operate the workload safely. Periodically re-run your ORR on the workload to catch any drift from best practices. You can have ORR checklists for new services launches and ORRs for periodic reviews. This helps keep you up to date on new best practices that arise and incorporate lessons learned from post-incident analysis. As your use of the cloud matures, you can build ORR requirements into your architecture as defaults. 

 **Desired outcome:**  You have an ORR checklist with best practices for your organization. ORRs are conducted before workloads launch. ORRs are run periodically over the course of the workload lifecycle. 

 **Common anti-patterns:** 
+ You launch a workload without knowing if you can operate it. 
+ Governance and security requirements are not included in certifying a workload for launch. 
+ Workloads are not re-evaluated periodically. 
+ Workloads launch without required procedures in place. 
+ You see repetition of the same root cause failures in multiple workloads. 

 **Benefits of establishing this best practice:** 
+  Your workloads include architecture, process, and management best practices. 
+  Lessons learned are incorporated into your ORR process. 
+  Required procedures are in place when workloads launch. 
+  ORRs are run throughout the software lifecycle of your workloads. 

 **Level of risk if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 An ORR is two things: a process and a checklist. Your ORR process should be adopted by your organization and supported by an executive sponsor. At a minimum, ORRs must be conducted before a workload launches to general availability. Run the ORR throughout the software development lifecycle to keep it up to date with best practices or new requirements. The ORR checklist should include configuration items, security and governance requirements, and best practices from your organization. Over time, you can use services, such as [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html), [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html), and [AWS Control Tower Guardrails](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html), to build best practices from the ORR into guardrails for automatic detection of best practices. 

 **Customer example** 

 After several production incidents, AnyCompany Retail decided to implement an ORR process. They built a checklist composed of best practices, governance and compliance requirements, and lessons learned from outages. New workloads conduct ORRs before they launch. Every workload conducts a yearly ORR with a subset of best practices to incorporate new best practices and requirements that are added to the ORR checklist. Over time, AnyCompany Retail used [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to detect some best practices, speeding up the ORR process. 

 **Implementation steps** 

 To learn more about ORRs, read the [Operational Readiness Reviews (ORR) whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html). It provides detailed information on the history of the ORR process, how to build your own ORR practice, and how to develop your ORR checklist. The following steps are an abbreviated version of that document. For an in-depth understanding of what ORRs are and how to build your own, we recommend reading that whitepaper. 

1. Gather the key stakeholders together, including representatives from security, operations, and development. 

1. Have each stakeholder provide at least one requirement. For the first iteration, try to limit the number of items to thirty or less. 
   +  [Appendix B: Example ORR questions](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/appendix-b-example-orr-questions.html) from the Operational Readiness Reviews (ORR) whitepaper contains sample questions that you can use to get started. 

1. Collect your requirements into a spreadsheet. 
   + You can use [custom lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) in the [AWS Well-Architected Tool](https://console.aws.amazon.com/wellarchiected/) to develop your ORR and share them across your accounts and AWS Organization. 

1. Identify one workload to conduct the ORR on. A pre-launch workload or an internal workload is ideal. 

1. Run through the ORR checklist and take note of any discoveries made. Discoveries might not be ok if a mitigation is in place. For any discovery that lacks a mitigation, add those to your backlog of items and implement them before launch. 

1. Continue to add best practices and requirements to your ORR checklist over time. 

 Support customers with Enterprise Support can request the [Operational Readiness Review Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. The workshop is an interactive *working backwards* session to develop your own ORR checklist. 

 **Level of effort for the implementation plan:** High. Adopting an ORR practice in your organization requires executive sponsorship and stakeholder buy-in. Build and update the checklist with inputs from across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) – Governance requirements are a natural fit for an ORR checklist. 
+ [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) – Compliance requirements are sometimes included in an ORR checklist. Other times they are a separate process. 
+ [OPS03-BP07 Resource teams appropriately](ops_org_culture_team_res_appro.md) – Team capability is a good candidate for an ORR requirement. 
+ [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) – A rollback or rollforward plan must be established before you launch your workload. 
+ [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) – To support a workload you must have the required personnel. 
+ [SEC01-BP03 Identify and validate control objectives](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_securely_operate_control_objectives.html) – Security control objectives make excellent ORR requirements. 
+ [REL13-BP01 Define recovery objectives for downtime and data loss](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_planning_for_recovery_objective_defined_recovery.html) – Disaster recovery plans are a good ORR requirement. 
+ [COST02-BP01 Develop policies based on your organization requirements](https://docs.aws.amazon.com/wellarchitected/latest/framework/cost_govern_usage_policies.html) – Cost management policies are good to include in your ORR checklist. 

 **Related documents:** 
+  [AWS Control Tower - Guardrails in AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html) 
+  [AWS Well-Architected Tool - Custom Lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) 
+  [Operational Readiness Review Template by Adrian Hornsby](https://medium.com/the-cloud-architect/operational-readiness-review-template-e23a4bfd8d79) 
+  [Operational Readiness Reviews (ORR) Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html) 

 **Related videos:** 
+  [AWS Supports You \$1 Building an Effective Operational Readiness Review (ORR)](https://www.youtube.com/watch?v=Keo6zWMQqS8) 

 **Related examples:** 
+  [Sample Operational Readiness Review (ORR) Lens](https://github.com/aws-samples/custom-lens-wa-sample/tree/main/ORR-Lens) 

 **Related services:** 
+  [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) 
+  [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html) 
+  [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html) 
+  [AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html) 

# OPS07-BP03 Use runbooks to perform procedures
<a name="ops_ready_to_support_use_runbooks"></a>

 A *runbook* is a documented process to achieve a specific outcome. Runbooks consist of a series of steps that someone follows to get something done. Runbooks have been used in operations going back to the early days of aviation. In cloud operations, we use runbooks to reduce risk and achieve desired outcomes. At its simplest, a runbook is a checklist to complete a task. 

 Runbooks are an essential part of operating your workload. From onboarding a new team member to deploying a major release, runbooks are the codified processes that provide consistent outcomes no matter who uses them. Runbooks should be published in a central location and updated as the process evolves, as updating runbooks is a key component of a change management process. They should also include guidance on error handling, tools, permissions, exceptions, and escalations in case a problem occurs. 

 As your organization matures, begin automating runbooks. Start with runbooks that are short and frequently used. Use scripting languages to automate steps or make steps easier to perform. As you automate the first few runbooks, you’ll dedicate time to automating more complex runbooks. Over time, most of your runbooks should be automated in some way. 

 **Desired outcome:** Your team has a collection of step-by-step guides for performing workload tasks. The runbooks contain the desired outcome, necessary tools and permissions, and instructions for error handling. They are stored in a central location and updated frequently. 

 **Common anti-patterns:** 
+  Relying on memory to complete each step of a process. 
+  Manually deploying changes without a checklist. 
+  Different team members performing the same process but with different steps or outcomes. 
+  Letting runbooks drift out of sync with system changes and automation. 

 **Benefits of establishing this best practice:** 
+  Reducing error rates for manual tasks. 
+  Operations are performed in a consistent manner. 
+  New team members can start performing tasks sooner. 
+  Runbooks can be automated to reduce toil. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Runbooks can take several forms depending on the maturity level of your organization. At a minimum, they should consist of a step-by-step text document. The desired outcome should be clearly indicated. Clearly document necessary special permissions or tools. Provide detailed guidance on error handling and escalations in case something goes wrong. List the runbook owner and publish it in a central location. Once your runbook is documented, validate it by having someone else on your team run it. As procedures evolve, update your runbooks in accordance with your change management process. 

 Your text runbooks should be automated as your organization matures. Using services like [AWS Systems Manager automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), you can transform flat text into automations that can be run against your workload. These automations can be run in response to events, reducing the operational burden to maintain your workload. 

 **Customer example** 

 AnyCompany Retail must perform database schema updates during software deployments. The Cloud Operations Team worked with the Database Administration Team to build a runbook for manually deploying these changes. The runbook listed each step in the process in checklist form. It included a section on error handling in case something went wrong. They published the runbook on their internal wiki along with their other runbooks. The Cloud Operations Team plans to automate the runbook in a future sprint. 

## Implementation steps
<a name="implementation-steps"></a>

 If you don’t have an existing document repository, a version control repository is a great place to start building your runbook library. You can build your runbooks using Markdown. We have provided an example runbook template that you can use to start building runbooks. 

```
# Runbook Title
## Runbook Info
| Runbook ID | Description | Tools Used | Special Permissions | Runbook Author | Last Updated | Escalation POC | 
|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this runbook for? What is the desired outcome? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name |
## Steps
1. Step one
2. Step two
```

1.  If you don’t have an existing documentation repository or wiki, create a new version control repository in your version control system. 

1.  Identify a process that does not have a runbook. An ideal process is one that is conducted semiregularly, short in number of steps, and has low impact failures. 

1.  In your document repository, create a new draft Markdown document using the template. Fill in `Runbook Title` and the required fields under `Runbook Info`. 

1.  Starting with the first step, fill in the `Steps` portion of the runbook. 

1.  Give the runbook to a team member. Have them use the runbook to validate the steps. If something is missing or needs clarity, update the runbook. 

1.  Publish the runbook to your internal documentation store. Once published, tell your team and other stakeholders. 

1.  Over time, you’ll build a library of runbooks. As that library grows, start working to automate runbooks. 

 **Level of effort for the implementation plan:** Low. The minimum standard for a runbook is a step-by-step text guide. Automating runbooks can increase the implementation effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md): Runbooks should have an owner in charge of maintaining them. 
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Runbooks and playbooks are like each other with one key difference: a runbook has a desired outcome. In many cases runbooks are initiated once a playbook has identified a root cause. 
+  [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md): Runbooks are a part of a good event, incident, and problem management practice. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Runbooks and playbooks should be used to respond to alerts. Over time these reactions should be automated. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md): Maintaining runbooks is a key part of knowledge management. 

 **Related documents:** 
+ [Achieving Operational Excellence using automated playbook and runbook](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/) 
+ [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+ [Migration playbook for AWS large migrations - Task 4: Improving your migration runbooks](https://docs.aws.amazon.com/prescriptive-guidance/latest/large-migration-migration-playbook/task-four-migration-runbooks.html) 
+ [Use AWS Systems Manager Automation runbooks to resolve operational tasks](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/) 

 **Related videos:** 
+  [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response (SEC318-R1)](https://www.youtube.com/watch?v=E1NaYN_fJUo) 
+  [How to automate IT Operations on AWS \$1 Amazon Web Services](https://www.youtube.com/watch?v=GuWj_mlyTug) 
+  [Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE) 

 **Related examples:** 
+  [AWS Systems Manager: Automation walkthroughs](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html) 
+  [AWS Systems Manager: Restore a root volume from the latest snapshot runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-document-sample-restore.html)
+  [Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake](https://catalog.us-east-1.prod.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US) 
+  [Gitlab - Runbooks](https://gitlab.com/gitlab-com/runbooks) 
+  [Rubix - A Python library for building runbooks in Jupyter Notebooks](https://github.com/Nurtch/rubix) 
+  [Using Document Builder to create a custom runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html) 
+  [Well-Architected Labs: Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/) 

 **Related services:** 
+  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 

# OPS07-BP04 Use playbooks to investigate issues
<a name="ops_ready_to_support_use_playbooks"></a>

 Playbooks are step-by-step guides used to investigate an incident. When incidents happen, playbooks are used to investigate, scope impact, and identify a root cause. Playbooks are used for a variety of scenarios, from failed deployments to security incidents. In many cases, playbooks identify the root cause that a runbook is used to mitigate. Playbooks are an essential component of your organization's incident response plans. 

 A good playbook has several key features. It guides the user, step by step, through the process of discovery. Thinking outside-in, what steps should someone follow to diagnose an incident? Clearly define in the playbook if special tools or elevated permissions are needed in the playbook. Having a communication plan to update stakeholders on the status of the investigation is a key component. In situations where a root cause can’t be identified, the playbook should have an escalation plan. If the root cause is identified, the playbook should point to a runbook that describes how to resolve it. Playbooks should be stored centrally and regularly maintained. If playbooks are used for specific alerts, provide your team with pointers to the playbook within the alert. 

 As your organization matures, automate your playbooks. Start with playbooks that cover low-risk incidents. Use scripting to automate the discovery steps. Make sure that you have companion runbooks to mitigate common root causes. 

 **Desired outcome:** Your organization has playbooks for common incidents. The playbooks are stored in a central location and available to your team members. Playbooks are updated frequently. For any known root causes, companion runbooks are built. 

 **Common anti-patterns:** 
+  There is no standard way to investigate an incident. 
+  Team members rely on muscle memory or institutional knowledge to troubleshoot a failed deployment. 
+  New team members learn how to investigate issues through trial and error. 
+  Best practices for investigating issues are not shared across teams. 

 **Benefits of establishing this best practice:** 
+  Playbooks boost your efforts to mitigate incidents. 
+  Different team members can use the same playbook to identify a root cause in a consistent manner. 
+  Known root causes can have runbooks developed for them, speeding up recovery time. 
+  Playbooks help team members to start contributing sooner. 
+  Teams can scale their processes with repeatable playbooks. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 How you build and use playbooks depends on the maturity of your organization. If you are new to the cloud, build playbooks in text form in a central document repository. As your organization matures, playbooks can become semi-automated with scripting languages like Python. These scripts can be run inside a Jupyter notebook to speed up discovery. Advanced organizations have fully automated playbooks for common issues that are auto-remediated with runbooks. 

 Start building your playbooks by listing common incidents that happen to your workload. Choose playbooks for incidents that are low risk and where the root cause has been narrowed down to a few issues to start. After you have playbooks for simpler scenarios, move on to the higher risk scenarios or scenarios where the root cause is not well known. 

 Your text playbooks should be automated as your organization matures. Using services like [AWS Systems Manager Automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), flat text can be transformed into automations. These automations can be run against your workload to speed up investigations. These automations can be activated in response to events, reducing the mean time to discover and resolve incidents. 

 Customers can use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to respond to incidents. This service provides a single interface to triage incidents, inform stakeholders during discovery and mitigation, and collaborate throughout the incident. It uses AWS Systems Manager Automations to speed up detection and recovery. 

 **Customer example** 

 A production incident impacted AnyCompany Retail. The on-call engineer used a playbook to investigate the issue. As they progressed through the steps, they kept the key stakeholders, identified in the playbook, up to date. The engineer identified the root cause as a race condition in a backend service. Using a runbook, the engineer relaunched the service, bringing AnyCompany Retail back online. 

## Implementation steps
<a name="implementation-steps"></a>

 If you don’t have an existing document repository, we suggest creating a version control repository for your playbook library. You can build your playbooks using Markdown, which is compatible with most playbook automation systems. If you are starting from scratch, use the following example playbook template. 

```
# Playbook Title
## Playbook Info
| Playbook ID | Description | Tools Used | Special Permissions | Playbook Author | Last Updated | Escalation POC | Stakeholders | Communication Plan |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this playbook for? What incident is it used for? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name | Stakeholder Name | How will updates be communicated during the investigation? |
## Steps
1. Step one
2. Step two
```

1.  If you don’t have an existing document repository or wiki, create a new version control repository for your playbooks in your version control system. 

1.  Identify a common issue that requires investigation. This should be a scenario where the root cause is limited to a few issues and resolution is low risk. 

1.  Using the Markdown template, fill in the `Playbook Name` section and the fields under `Playbook Info`. 

1.  Fill in the troubleshooting steps. Be as clear as possible on what actions to perform or what areas you should investigate. 

1.  Give a team member the playbook and have them go through it to validate it. If there’s anything missing or something isn’t clear, update the playbook. 

1.  Publish your playbook in your document repository and inform your team and any stakeholders. 

1.  This playbook library will grow as you add more playbooks. Once you have several playbooks, start automating them using tools like AWS Systems Manager Automations to keep automation and playbooks in sync. 

 **Level of effort for the implementation plan:** Low. Your playbooks should be text documents stored in a central location. More mature organizations will move towards automating playbooks. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md): Playbooks should have an owner in charge of maintaining them. 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Runbooks and playbooks are similar, but with one key difference: a runbook has a desired outcome. In many cases, runbooks are used once a playbook has identified a root cause. 
+  [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md): Playbooks are a part of good event, incident, and problem management practice. 
+  [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md): Runbooks and playbooks should be used to respond to alerts. Over time, these reactions should be automated. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md): Maintaining playbooks is a key part of knowledge management. 

 **Related documents:** 
+ [ Achieving Operational Excellence using automated playbook and runbook ](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/)
+  [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+ [ Use AWS Systems Manager Automation runbooks to resolve operational tasks ](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/)

 **Related videos:** 
+ [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response (SEC318-R1) ](https://www.youtube.com/watch?v=E1NaYN_fJUo)
+ [AWS Systems Manager Incident Manager - AWS Virtual Workshops ](https://www.youtube.com/watch?v=KNOc0DxuBSY)
+ [ Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE)

 **Related examples:** 
+ [AWS Customer Playbook Framework ](https://github.com/aws-samples/aws-customer-playbook-framework)
+ [AWS Systems Manager: Automation walkthroughs ](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html)
+ [ Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake ](https://catalog.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US)
+ [ Rubix – A Python library for building runbooks in Jupyter Notebooks ](https://github.com/Nurtch/rubix)
+ [ Using Document Builder to create a custom runbook ](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html)
+ [ Well-Architected Labs: Automating operations with Playbooks and Runbooks ](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/)
+ [ Well-Architected Labs: Incident response playbook with Jupyter ](https://www.wellarchitectedlabs.com/security/300_labs/300_incident_response_playbook_with_jupyter-aws_iam/)

 **Related services:** 
+ [AWS Systems Manager Automation ](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html)
+ [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html)

# OPS07-BP05 Make informed decisions to deploy systems and changes
<a name="ops_ready_to_support_informed_deploy_decisions"></a>

Have processes in place for successful and unsuccessful changes to your workload. A pre-mortem is an exercise where a team simulates a failure to develop mitigation strategies. Use pre-mortems to anticipate failure and create procedures where appropriate. Evaluate the benefits and risks of deploying changes to your workload. Verify that all changes comply with governance. 

 **Desired outcome:** 
+  You make informed decisions when deploying changes to your workload. 
+  Changes comply with governance. 

 **Common anti-patterns:** 
+ Deploying a change to our workload without a process to handle a failed deployment.
+ Making changes to your production environment that are out of compliance with governance requirements.
+ Deploying a new version of your workload without establishing a baseline for resource utilization.

 **Benefits of establishing this best practice:** 
+  You are prepared for unsuccessful changes to your workload. 
+  Changes to your workload are compliant with governance policies. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use pre-mortems to develop processes for unsuccessful changes. Document your processes for unsuccessful changes. Ensure that all changes comply with governance. Evaluate the benefits and risks to deploying changes to your workload. 

 **Customer example** 

 AnyCompany Retail regularly conducts pre-mortems to validate their processes for unsuccessful changes. They document their processes in a shared Wiki and update it frequently. All changes comply with governance requirements. 

 **Implementation steps** 

1.  Make informed decisions when deploying changes to your workload. Establish and review criteria for a successful deployment. Develop scenarios or criteria that would initiate a rollback of a change. Weigh the benefits of deploying changes against the risks of an unsuccessful change. 

1.  Verify that all changes comply with governance policies. 

1.  Use pre-mortems to plan for unsuccessful changes and document mitigation strategies. Run a table-top exercise to model an unsuccessful change and validate roll-back procedures. 

 **Level of effort for the implementation plan:** Moderate. Implementing a practice of pre-mortems requires coordination and effort from stakeholders across your organization 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements are a key factor in determining whether to deploy a change. 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) - Establish plans to mitigate a failed deployment and use pre-mortems to validate them. 
+  [OPS06-BP02 Test and validate changes](ops_mit_deploy_risks_test_val_chg.md) - Every software change should be properly tested before deployment in order to reduce defects in production. 
+  [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) - Having enough trained personnel to support the workload is essential to making an informed decision to deploy a system change. 

 **Related documents:** 
+ [ Amazon Web Services: Risk and Compliance ](https://docs.aws.amazon.com/whitepapers/latest/aws-risk-and-compliance/welcome.html)
+ [AWS Shared Responsibility Model ](https://aws.amazon.com/compliance/shared-responsibility-model/)
+ [ Governance in the AWS Cloud: The Right Balance Between Agility and Safety ](https://aws.amazon.com/blogs/apn/governance-in-the-aws-cloud-the-right-balance-between-agility-and-safety/)

# OPS07-BP06 Create support plans for production workloads
<a name="ops_ready_to_support_enable_support_plans"></a>

 Enable support for any software and services that your production workload relies on. Select an appropriate support level to meet your production service-level needs. Support plans for these dependencies are necessary in case there is a service disruption or software issue. Document support plans and how to request support for all service and software vendors. Implement mechanisms that verify that support points of contacts are kept up to date. 

 **Desired outcome:** 
+  Implement support plans for software and services that production workloads rely on. 
+  Choose an appropriate support plan based on service-level needs. 
+  Document the support plans, support levels, and how to request support. 

 **Common anti-patterns:** 
+  You have no support plan for a critical software vendor. Your workload is impacted by them and you can do nothing to expedite a fix or get timely updates from the vendor. 
+  A developer that was the primary point of contact for a software vendor left the company. You are not able to reach the vendor support directly. You must spend time researching and navigating generic contact systems, increasing the time required to respond when needed. 
+  A production outage occurs with a software vendor. There is no documentation on how to file a support case. 

 **Benefits of establishing this best practice:** 
+  With the appropriate support level, you are able to get a response in the time frame necessary to meet service-level needs. 
+  As a supported customer you can escalate if there are production issues. 
+  Software and services vendors can assist in troubleshooting during an incident. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Enable support plans for any software and services vendors that your production workload relies on. Set up appropriate support plans to meet service-level needs. For AWS customers, this means activating AWS Business Support or greater on any accounts where you have production workloads. Meet with support vendors on a regular cadence to get updates about support offerings, processes, and contacts. Document how to request support from software and services vendors, including how to escalate if there is an outage. Implement mechanisms to keep support contacts up to date. 

 **Customer example** 

 At AnyCompany Retail, all commercial software and services dependencies have support plans. For example, they have AWS Enterprise Support activated on all accounts with production workloads. Any developer can raise a support case when there is an issue. There is a wiki page with information on how to request support, whom to notify, and best practices for expediting a case. 

 **Implementation steps** 

1.  Work with stakeholders in your organization to identify software and services vendors that your workload relies on. Document these dependencies. 

1.  Determine service-level needs for your workload. Select a support plan that aligns with them. 

1.  For commercial software and services, establish a support plan with the vendors. 

   1.  Subscribing to AWS Business Support or greater for all production accounts provides faster response time from AWS Support and strongly recommended. If you don’t have premium support, you must have an action plan to handle issues, which require help from AWS Support. AWS Support provides a mix of tools and technology, people, and programs designed to proactively help you optimize performance, lower costs, and innovate faster. AWS Business Support provides additional benefits, including access to AWS Trusted Advisor and AWS Personal Health Dashboard and faster response times. 

1.  Document the support plan in your knowledge management tool. Include how to request support, who to notify if a support case is filed, and how to escalate during an incident. A wiki is a good mechanism to allow anyone to make necessary updates to documentation when they become aware of changes to support processes or contacts. 

 **Level of effort for the implementation plan:** Low. Most software and services vendors offer opt-in support plans. Documenting and sharing support best practices on your knowledge management system verifies that your team knows what to do when there is a production issue. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) 

 **Related documents:** 
+ [AWS Support Plans ](https://docs.aws.amazon.com/awssupport/latest/user/aws-support-plans.html)

 **Related services:** 
+ [AWS Business Support ](https://aws.amazon.com/premiumsupport/plans/business/)
+ [AWS Enterprise Support ](https://aws.amazon.com/premiumsupport/plans/enterprise/)