Event management (AIOps) - AWS Cloud Adoption Framework: Operations Perspective

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

Event management (AIOps)

Detect events, assess their potential impact, and determine the appropriate control action.

An event is an observation of an action, occurrence, or change of state. Events can be planned or unplanned and they can originate internally or externally to the workload. For example, an advertising promotion for a retail site is a planned event that originates externally to the workload. In contrast, a component failure is an unplanned event that originates internally to the workload. Events that require a response are called incidents. By using the right measurement and threshold for normal operating condition, an event can be detected when the threshold is breached.

Using advanced machine learning techniques, you can reduce operational incidents and increase service quality. Artificial Intelligence for IT Operations (AIOps) can help you increase service quality by grouping related incidents, predict incidents before they happen, and classify new incidents and insights.

Start

Efficient and effective management of planned and unplanned operational events is required to achieve operational excellence. Application classification determines the criticality of workloads based on business impact and customer experience, simplifying event handling and prioritization of concurrent events.

Establish fine-grained alerts and thresholds with differing actions based on criticality. For example, applications serving critical user-flow of publishing documents should have rapid escalations, while less impactful slow disk saturation of less critical application might be addressed during business hours.

Modern applications, such as those running on microservices architectures, generate large volumes of data in the form of metrics, logs, and events. Use Amazon CloudWatch to collect, access, and correlate this data on a single platform from across all your AWS resources, applications, and services that run on AWS and on-premises servers, helping you break down data silos so you can easily gain system-wide visibility and quickly resolve issues.

CloudWatch simplifies the collection of technical metrics as it natively integrates with more than 70 AWS services (including Amazon EC2, AWS Lambda, Amazon ECS, Amazon EKS, Amazon DynamoDB, Amazon S3) and automatically publishes detailed 1-minute metrics and custom metrics with up to one-second granularity so you can dive deep into your logs for additional context. You can also use CloudWatch in hybrid cloud architectures by using the CloudWatch agent or API to monitor your on-premises resources.

Visibility into your AWS account activity is a key aspect of monitoring, security and operational best practices. AWS CloudTrail is an AWS service that helps you enable governance, compliance, and operational and risk auditing of your AWS account. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail. Events include actions taken in the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDKs and APIs.

CloudWatch and CloudTrail enable you to explore, download, archive, analyze and visualize your events, and respond to account activity across your AWS infrastructure. You can identify who or what took which action, what resources were acted upon, when the event occurred, and other details to help you analyze and respond to activity in your AWS account.

Advance

Once workloads have been designed to provide information necessary to understand their internal state (metrics, logs, and traces), you should seek to ensure that the correct events are being propagated. Track events at the correct level of granularity and implement a mechanism to review and update thresholds, limits, and event handling rules. Centrally collect and store events of interest to help teams to view, investigate, and resolve them. Centralized and standardized experience improves timely management, detection and remediation as well as creating a lower barrier to onboard new operations engineers responsible for managing events.

Business and operational metrics derived from desired business outcomes enable you to understand the health of your workload, prioritize operations activities, and respond to events. Establishing metric baselines helps to improve operations, investigation, and intervention. Use established runbooks for well-understood events, and use playbooks to aid in investigation and resolution of issues. Prioritize responses to events based on their business and customer impact. Ensure that, if an alert is raised in response to an event, there is an associated process to be followed, with a specifically identified owner. Define, in advance, the personnel required to resolve an event and include escalation triggers to engage additional personnel as it becomes necessary, based on urgency and impact. Identify and engage individuals with the authority to decide on the course of action where there will be a business impact from an event response not previously addressed or authorized.

You can use CloudWatch anomaly detection to detect anomalous behavior in your environments. When you enable anomaly detection for a metric, CloudWatch applies machine learning (ML) algorithms to the metric's past data to create a model of the metric's expected values. The model assesses both trends and hourly, daily, and weekly patterns of the metric. Use CloudWatch to set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.

Several AWS services publish CloudWatch metrics can be used to gain system-wide visibility into resource utilization, application performance, and operational health. However, with distributed systems that use these services, your application telemetry should capture information to enable situational awareness. Instrumentation requires explicit code that records how long tasks take, how often certain code paths are executed, metadata about what the task was working on, and what parts of the tasks succeeded or failed. Further, it may be important to follow the flow of a request using a trace ID, as the request enters the system and passes through various systems before it is fulfilled. AWS X-Ray makes it easy to analyze the behavior of your distributed applications by providing request tracing, exception collection, and profiling capabilities.

AWS Systems Manager OpsCenter provides a central location to view, investigate, and resolve OpsItems related to AWS resources. OpsCenter is integrated with Amazon EventBridge and CloudWatch and designed to reduce mean time to resolution for issues impacting AWS resources. OpsCenter aggregates information from AWS Config, CloudTrail logs, and Amazon EventBridge events, so you don't have to navigate across multiple console pages during your investigation.

Excel

Customers seeking to accelerate business goals such as availability, MTTD, and MTTR, are often challenged with identifying the correct KPIs. Production applications can experience a wide variety of issues, and proactively identifying all potential operational problems is a time-consuming and challenging task. This is also increasingly common in modern microservice-based architectures with distributed and decoupled components. Metrics and logs need to be gathered from workloads, which humans then need to assess and hypothesize on potential operational problems and resolutions. Knowing what metrics to measure and the purpose they serve, as well as implementing alert governance to separate signal from noise can help reduce alert fatigue for the operators. Too much noise can cause alert fatigue that can lead to alerts being missed or ignored, or to responses being delayed.

Amazon DevOps Guru can be used to provide ML-powered service insights that make it easy for developers and operators to automatically detect issues that can improve application availability and performance. Amazon DevOps Guru, which can be enabled at the AWS account or CloudFormation stack level, can detect issues by correlating metric anomalies, traces, changes, and log events triggered by incidents. The service produces insights which are a collection of identified anomalies containing observations, recommendations, and analytical data that you can use to improve your operational performance.

Each insight contains a list of the metrics and events that were used to identify the unusual behavior, as well as specific recommendations that can help you improve the performance of your application. These insights can be surfaced directly within OpsCenter dashboards (as OpsItems) and integrated into a workflow for team visibility; you can also perform actions such as running runbooks (Systems Manager automation documents). Examples of automatically detected operational issues include missing or misconfigured alarms, early warning signs of resource exhaustion, and code and configuration changes that could lead to outages. DevOps Guru insights can notify engineers using SNS and perform customized actions using Lambda functions for further workflow integration, such as with ticketing systems or AWS Config.