# OPS 10. How do you manage workload and operations events?
<a name="ops-10"></a>

 Prepare and validate procedures for responding to events to minimize their disruption to your workload. 

**Topics**
+ [OPS10-BP01 Use a process for event, incident, and problem management](ops_event_response_event_incident_problem_process.md)
+ [OPS10-BP02 Have a process per alert](ops_event_response_process_per_alert.md)
+ [OPS10-BP03 Prioritize operational events based on business impact](ops_event_response_prioritize_events.md)
+ [OPS10-BP04 Define escalation paths](ops_event_response_define_escalation_paths.md)
+ [OPS10-BP05 Define a customer communication plan for outages](ops_event_response_push_notify.md)
+ [OPS10-BP06 Communicate status through dashboards](ops_event_response_dashboards.md)
+ [OPS10-BP07 Automate responses to events](ops_event_response_auto_event_response.md)

# OPS10-BP01 Use a process for event, incident, and problem management
<a name="ops_event_response_event_incident_problem_process"></a>

Your organization has processes to handle events, incidents, and problems. *Events* are things that occur in your workload but may not need intervention. *Incidents* are events that require intervention. *Problems* are recurring events that require intervention or cannot be resolved. You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately.

When incidents and problems happen to your workload, you need processes to handle them. How will you communicate the status of the event with stakeholders? Who oversees leading the response? What are the tools that you use to mitigate the event? These are examples of some of the questions you need answer to have a solid response process. 

Processes must be documented in a central location and available to anyone involved in your workload. If you don’t have a central wiki or document store, a version control repository can be used. You’ll keep these plans up to date as your processes evolve. 

Problems are candidates for automation. These events take time away from your ability to innovate. Start with building a repeatable process to mitigate the problem. Over time, focus on automating the mitigation or fixing the underlying issue. This frees up time to devote to making improvements in your workload. 

**Desired outcome:** Your organization has a process to handle events, incidents, and problems. These processes are documented and stored in a central location. They are updated as processes change. 

**Common anti-patterns:** 
+  An incident happens on the weekend and the on-call engineer doesn’t know what to do. 
+  A customer sends you an email that the application is down. You reboot the server to fix it. This happens frequently. 
+  There is an incident with multiple teams working independently to try to solve it. 
+  Deployments happen in your workload without being recorded. 

 **Benefits of establishing this best practice:** 
+  You have an audit trail of events in your workload. 
+  Your time to recover from an incident is decreased. 
+  Team members can resolve incidents and problems in a consistent manner. 
+  There is a more consolidated effort when investigating an incident. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed. 

 **Customer example** 

AnyCompany Retail has a portion of their internal wiki devoted to processes for event, incident, and problem management. All events are sent to [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html). Problems are identified as OpsItems in [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) and prioritized to fix, reducing undifferentiated labor. As processes change, they’re updated in their internal wiki. They use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to manage incidents and coordinate mitigation efforts. 

## Implementation steps
<a name="implementation-steps"></a>

1.  Events 
   +  Track events that happen in your workload, even if no human intervention is required. 
   +  Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching. 
   +  You can use services like [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) or [Amazon Simple Notification Service](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) to generate custom events for tracking. 

1.  Incidents 
   +  Start by defining the communication plan for incidents. What stakeholders must be informed? How will you keep them in the loop? Who oversees coordinating efforts? We recommend standing up an internal chat channel for communication and coordination. 
   +  Define escalation paths for the teams that support your workload, especially if the team doesn’t have an on-call rotation. Based on your support level, you can also file a case with Support. 
   +  Create a playbook to investigate the incident. This should include the communication plan and detailed investigation steps. Include checking the [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) in your investigation. 
   +  Document your incident response plan. Communicate the incident management plan so internal and external customers understand the rules of engagement and what is expected of them. Train your team members on how to use it. 
   +  Customers can use [Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to set up and manage their incident response plan. 
   +  Enterprise Support customers can request the [Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement. 

1.  Problems 
   +  Problems must be identified and tracked in your ITSM system. 
   +  Identify all known problems and prioritize them by effort to fix and impact to workload.   
![\[Action priority matrix for prioritizing problems.\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/impact-effort-chart.png)
   +  Solve problems that are high impact and low effort first. Once those are solved, move on to problems to that fall into the low impact low effort quadrant. 
   +  You can use [Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) to identify these problems, attach runbooks to them, and track them. 

**Level of effort for the implementation plan:** Medium. You need both a process and tools to implement this best practice. Document your processes and make them available to anyone associated with the workload. Update them frequently. You have a process for managing problems and mitigating them or fixing them. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md): Known problems need an associated runbook so that mitigation efforts are consistent.
+  [OPS07-BP04 Use playbooks to investigate issues](ops_ready_to_support_use_playbooks.md): Incidents must be investigated using playbooks. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md): Always conduct a postmortem after you recover from an incident. 

 **Related documents:** 
+  [Atlassian - Incident management in the age of DevOps](https://www.atlassian.com/incident-management/devops) 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+  [Incident Management in the Age of DevOps and SRE](https://www.infoq.com/presentations/incident-management-devops-sre/) 
+  [PagerDuty - What is Incident Management?](https://www.pagerduty.com/resources/learn/what-is-incident-management/) 

 **Related videos:** 
+  [AWS re:Invent 2020: Incident management in a distributed organization](https://www.youtube.com/watch?v=tyS1YDhMVos) 
+  [AWS re:Invent 2021 - Building next-gen applications with event-driven architectures](https://www.youtube.com/watch?v=U5GZNt0iMZY) 
+  [AWS Supports You \$1 Exploring the Incident Management Tabletop Exercise](https://www.youtube.com/watch?v=0m8sGDx-pRM) 
+  [AWS Systems Manager Incident Manager - AWS Virtual Workshops](https://www.youtube.com/watch?v=KNOc0DxuBSY) 
+  [AWS What's Next ft. Incident Manager \$1 AWS Events](https://www.youtube.com/watch?v=uZL-z7cII3k) 

 **Related examples:** 
+  [AWS Management and Governance Tools Workshop - OpsCenter](https://mng.workshop.aws/ssm/capability_hands-on_labs/opscenter.html) 
+  [AWS Proactive Services – Incident Management Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/#Operational_Workshops_and_Deep_Dives) 
+  [Building an event-driven application with Amazon EventBridge](https://aws.amazon.com/blogs/compute/building-an-event-driven-application-with-amazon-eventbridge/) 
+  [Building event-driven architectures on AWS](https://catalog.us-east-1.prod.workshops.aws/workshops/63320e83-6abc-493d-83d8-f822584fb3cb/en-US/) 

 **Related services:** 
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) 
+  [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) 
+  [AWS Health Dashboard](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 
+  [AWS Systems Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html) 

# OPS10-BP02 Have a process per alert
<a name="ops_event_response_process_per_alert"></a>

 Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert. This ensures effective and prompt responses to operations events and prevents actionable events from being obscured by less valuable notifications. 

 **Common anti-patterns:** 
+  Your monitoring system presents you a stream of approved connections along with other messages. The volume of messages is so large that you miss periodic error messages that require your intervention. 
+  You receive an alert that the website is down. There is no defined process for when this happens. You are forced to take an ad hoc approach to diagnose and resolve the issue. Developing this process as you go extends the time to recovery. 

 **Benefits of establishing this best practice:** By alerting only when action is required, you prevent low value alerts from concealing high value alerts. By having a process for every actionable alert, you create a consistent and prompt response to events in your environment. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Process per alert: Any event for which you raise an alert should have a well-defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful completion. Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes. By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example, automatic scaling might be applied to scale a web front end, but the operations team might be accountable to ensure that the automatic scaling rules and limits are appropriate for workload needs. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

# OPS10-BP03 Prioritize operational events based on business impact
<a name="ops_event_response_prioritize_events"></a>

 Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, or damage to reputation or trust. 

 **Common anti-patterns:** 
+  You receive a support request to add a printer configuration for a user. While working on the issue, you receive a support request stating that your retail site is down. After completing the printer configuration for your user, you start work on the website issue. 
+  You get notified that both your retail website and your payroll system are down. You don't know which one should get priority. 

 **Benefits of establishing this best practice:** Prioritizing responses to the incidents with the greatest impact on the business notifies your management of that impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. Impacts can include loss of life or injury, financial loss, regulatory violations, or damage to reputation or trust. 

# OPS10-BP04 Define escalation paths
<a name="ops_event_response_define_escalation_paths"></a>

 Define escalation paths in your runbooks and playbooks, including what initiates escalation, and procedures for escalation. Specifically identify owners for each action to ensure effective and prompt responses to operations events. 

 Identify when a human decision is required before an action is taken. Work with decision makers to have that decision made in advance, and the action preapproved, so that MTTR is not extended waiting for a response. 

 **Common anti-patterns:** 
+  Your retail site is down. You don't understand the runbook for recovering the site. You start calling colleagues hoping that someone will be able to help you. 
+  You receive a support case for an unreachable application. You don't have permissions to administer the system. You don't know who does. You attempt to contact the system owner that opened the case and there is no response. You have no contacts for the system and your colleagues are not familiar with it. 

 **Benefits of establishing this best practice:** By defining escalations, what initiates the escalation, and procedures for escalation you provide the systematic addition of resources to an incident at an appropriate rate for the impact. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Define escalation paths: Define escalation paths in your runbooks and playbooks, including what starts escalation, and procedures for escalation. For example, escalation of an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed. Another example of an appropriate escalation path is from senior support engineers to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time has elapsed. Specifically identify owners for each action to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network connectivity provider or a software vendor. Escalations can include identified authorized decision makers for impacted systems. 

# OPS10-BP05 Define a customer communication plan for outages
<a name="ops_event_response_push_notify"></a>

 Define and test a communication plan for system outages that you can rely on to keep your customers and stakeholders informed during outages. Communicate directly with your users both when the services they use are impacted and when services return to normal. 

 **Desired outcome:** 
+  You have a communication plan for situations ranging from scheduled maintenance to large unexpected failures, including invocation of disaster recovery plans. 
+  In your communications, you provide clear and transparent information about systems issues to help customers avoid second guessing the performance of their systems. 
+  You use custom error messages and status pages to reduce the spike in help desk requests and keep users informed. 
+  The communication plan is regularly tested to verify that it will perform as intended when a real outage occurs. 

 **Common anti-patterns:** 
+ A workload outage occurs but you have no communication plan. Users overwhelm your trouble ticket system with requests because they have no information on the outage.
+ You send an email notification to your users during an outage. It doesn’t contain a timeline for restoration of service so users cannot plan around the outage.
+ There is a communication plan for outages but it has never been tested. An outage occurs and the communication plan fails because a critical step was missed that could have been caught in testing.
+  During an outage, you send a notification to users with too many technical details and information under your AWS NDA. 

 **Benefits of establishing this best practice:** 
+  Maintaining communication during outages ensures that customers are provided with visibility of progress on issues and estimated time to resolution. 
+  Developing a well-defined communications plan verifies that your customers and end users are well informed so they can take required additional steps to mitigate the impact of outages. 
+  With proper communications and increased awareness of planned and unplanned outages, you can improve customer satisfaction, limit unintended reactions, and drive customer retention. 
+  Timely and transparent system outage communication builds confidence and establishes trust needed to maintain relationships between you and your customers. 
+  A proven communication strategy during an outage or crisis reduces speculation and gossip that could hinder your ability to recover. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Communication plans that keep your customers informed during outages are holistic and cover multiple interfaces including customer facing error pages, custom API error messages, system status banners, and health status pages. If your system includes registered users, you can communicate over messaging channels such as email, SMS or push notifications to send personalized message content to your customers. 

 **Customer communication tools** 

 As a first line of defense, web and mobile applications should provide friendly and informative error messages during an outage as well as have the ability to redirect traffic to a status page. [Amazon CloudFront](https://aws.amazon.com/cloudfront/) is a fully managed content delivery network (CDN) that includes capabilities to define and serve custom error content. Custom error pages in CloudFront are a good first layer of customer messaging for component level outages. CloudFront can also simplify managing and activating a status page to intercept all requests during planned or unplanned outages. 

 Custom API error messages can help detect and reduce impact when outages are isolated to discrete services. [Amazon API Gateway](https://aws.amazon.com/api-gateway/) allows you to configure custom responses for your REST APIs. This allows you to provide clear and meaningful messaging to API consumers when API Gateway is not able to reach backend services. Custom messages can also be used to support outage banner content and notifications when a particular system feature is degraded due to service tier outages. 

 Direct messaging is the most personalized type of customer messaging. [Amazon Pinpoint](https://aws.amazon.com/pinpoint/) is a managed service for scalable multichannel communications. Amazon Pinpoint allows you to build campaigns that can broadcast messages widely across your impacted customer base over SMS, email, voice, push notifications, or custom channels you define. When you manage messaging with Amazon Pinpoint, message campaigns are well defined, testable, and can be intelligently applied to targeted customer segments. Once established, campaigns can be scheduled or started by events and they can easily be tested. 

 **Customer example** 

 When the workload is impaired, AnyCompany Retail sends out an email notification to their users. The email describes what business functionality is impaired and provides a realistic estimate of when service will be restored. In addition, they have a status page that shows real-time information about the health of their workload. The communication plan is tested in a development environment twice per year to validate that it is effective. 

 **Implementation steps** 

1.  Determine the communication channels for your messaging strategy. Consider the architectural aspects of your application and determine the best strategy for delivering feedback to your customers. This could include one or more of the guidance strategies outlined including error and status pages, custom API error responses, or direct messaging. 

1.  Design status pages for your application. If you’ve determined that status or custom error pages are suitable for your customers, you’ll need to design your content and messaging for those pages. Error pages explain to users why an application is not available, when it may become available again, and what they can do in the meantime. If your application uses Amazon CloudFront you can serve [custom error responses](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/GeneratingCustomErrorResponses.html) or use Lambda at Edge to [translate errors](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-examples.html#lambda-examples-update-error-status-examples) and rewrite page content. CloudFront also makes it possible to swap destinations from your application content to a static [Amazon S3](https://aws.amazon.com/s3/) content origin containing your maintenance or outage status page . 

1.  Design the correct set of API error statuses for your service. Error messages produced by API Gateway when it can’t reach backend services, as well as service tier exceptions, may not contain friendly messages suitable for display to end users. Without having to make code changes to your backend services, you can configure API Gateway [custom error responses](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html) to map HTTP response codes to curated API error messages. 

1.  Design messaging from a business perspective so that it is relevant to end users for your system and does not contain technical details. Consider your audience and align your messaging. For example, you may steer internal users towards a workaround or manual process that leverages alternate systems. External users may be asked to wait until the system is restored, or subscribe to updates to receive a notification once the system is restored. Define approved messaging for multiple scenarios including unexpected outages, planned maintenance, and partial system failures where a particular feature may be degraded or unavailable. 

1.  Templatize and automate your customer messaging. Once you have established your message content, you can use [Amazon Pinpoint](https://docs.aws.amazon.com/pinpoint/latest/developerguide/welcome.html) or other tools to automate your messaging campaign. With Amazon Pinpoint you can create customer target segments for specific affected users and transform messages into templates. Review the [Amazon Pinpoint tutorial](https://docs.aws.amazon.com/pinpoint/latest/developerguide/tutorials.html) to get an understanding of how-to setup a messaging campaign. 

1.  Avoiding tightly coupling messaging capabilities to your customer facing system. Your messaging strategy should not have hard dependencies on system data stores or services to verify that you can successfully send messages when you experience outages. Consider building the ability to send messages from more than [one Availability Zone or Region](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_fault_isolation_multiaz_region_system.html) for messaging availability. If you are using AWS services to send messages, leverage data plane operations over [control plane operation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_avoid_control_plane.html) to invoke your messaging. 

 **Level of effort for the implementation plan:** High. Developing a communication plan, and the mechanisms to send it, can require a significant effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS07-BP03 Use runbooks to perform procedures](ops_ready_to_support_use_runbooks.md) - Your communication plan should have a runbook associated with it so that your personnel know how to respond. 
+  [OPS11-BP02 Perform post-incident analysis](ops_evolve_ops_perform_rca_process.md) - After an outage, conduct post-incident analysis to identify mechanisms to prevent another outage. 

 **Related documents:** 
+ [ Error Handling Patterns in Amazon API Gateway and AWS Lambda](https://aws.amazon.com/blogs/compute/error-handling-patterns-in-amazon-api-gateway-and-aws-lambda/)
+ [ Amazon API Gateway responses ](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gatewayResponse-definition.html#supported-gateway-response-types)

 **Related examples:** 
+ [AWS Health Dashboard ](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/)
+ [ Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region ](https://aws.amazon.com/message/12721/)

 **Related services:** 
+ [AWS Support](https://aws.amazon.com/premiumsupport/)
+ [AWS Customer Agreement ](https://aws.amazon.com/agreement/)
+ [ Amazon CloudFront ](https://aws.amazon.com/cloudfront/)
+ [ Amazon API Gateway ](https://aws.amazon.com/api-gateway/)
+ [ Amazon Pinpoint ](https://aws.amazon.com/pinpoint/)
+ [ Amazon S3 ](https://aws.amazon.com/s3/)

# OPS10-BP06 Communicate status through dashboards
<a name="ops_event_response_dashboards"></a>

 Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. 

 You can create dashboards using [Amazon CloudWatch Dashboards](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) on customizable home pages in the CloudWatch console. Using business intelligence services such as [Quick](https://aws.amazon.com/quicksight/) you can create and publish interactive dashboards of your workload and operational health (for example, order rates, connected users, and transaction times). Create Dashboards that present system and business-level views of your metrics. 

 **Common anti-patterns:** 
+  Upon request, you run a report on the current utilization of your application for management. 
+  During an incident, you are contacted every twenty minutes by a concerned system owner wanting to know if it is fixed yet. 

 **Benefits of establishing this best practice:** By creating dashboards, you create self-service access to information helping your customers to informed themselves and determine if they need to take action. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Health Dashboard. 
  +  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Quick](https://aws.amazon.com/quicksight/) 
+  [CloudWatch dashboards create and use customized metrics views](https://aws.amazon.com/blogs/aws/cloudwatch-dashboards-create-use-customized-metrics-views/) 

# OPS10-BP07 Automate responses to events
<a name="ops_event_response_auto_event_response"></a>

 Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 

 There are multiple ways to automate runbook and playbook actions on AWS. To respond to an event from a state change in your AWS resources, or from your own custom events, you should create [CloudWatch Events rules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) to initiate responses through CloudWatch targets (for example, Lambda functions, Amazon Simple Notification Service (Amazon SNS) topics, Amazon ECS tasks, and AWS Systems Manager Automation). 

 To respond to a metric that crosses a threshold for a resource (for example, wait time), you should create [CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) to perform one or more actions using Amazon EC2 actions, Auto Scaling actions, or to send a notification to an Amazon SNS topic. If you need to perform custom actions in response to an alarm, invoke Lambda through an Amazon SNS notification. Use Amazon SNS to publish event notifications and escalation messages to keep people informed. 

 AWS also supports third-party systems through the AWS service APIs and SDKs. There are a number of monitoring tools provided by AWS Partners and third parties that allow for monitoring, notifications, and responses. Some of these tools include New Relic, Splunk, Loggly, SumoLogic, and Datadog. 

 You should keep critical manual procedures available for use when automated procedures fail 

 **Common anti-patterns:** 
+  A developer checks in their code. This event could have been used to start a build and then perform testing but instead nothing happens. 
+  Your application logs a specific error before it stops working. The procedure to restart the application is well understood and could be scripted. You could use the log event to invoke a script and restart the application. Instead, when the error happens at 3am Sunday morning, you are woken up as the on-call resource responsible to fix the system. 

 **Benefits of establishing this best practice:** By using automated responses to events, you reduce the time to respond and limit the introduction of errors from manual activities. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses. 
  +  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 
  +  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
  +  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
  +  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon CloudWatch Features](https://aws.amazon.com/cloudwatch/features/) 
+  [CloudWatch Events event examples from supported services](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html) 
+  [Creating a CloudWatch Events rule that starts on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) 
+  [Creating a CloudWatch Events rule that starts on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html) 
+  [What is Amazon CloudWatch Events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) 

 **Related videos:** 
+  [Build a Monitoring Plan](https://www.youtube.com/watch?v=OMmiGETJpfU) 

 **Related examples:**