This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Event management (AIOps)
Detect events, assess their potential impact, and determine the appropriate control action.
An event is an observation of an action,
occurrence, or change of state. Events can be planned or unplanned
and they can originate internally or externally to the
workload
Using advanced machine learning techniques, you can reduce operational incidents and increase service quality. Artificial Intelligence for IT Operations (AIOps) can help you increase service quality by grouping related incidents, predict incidents before they happen, and classify new incidents and insights.
Start
Efficient and effective management of planned and unplanned
operational events is required to achieve
operational
excellence
Establish fine-grained alerts and thresholds with differing actions based on criticality. For example, applications serving critical user-flow of publishing documents should have rapid escalations, while less impactful slow disk saturation of less critical application might be addressed during business hours.
Modern applications, such as those running on microservices architectures, generate large volumes of data in the form of metrics, logs, and events. Use Amazon CloudWatch to collect, access, and correlate this data on a single platform from across all your AWS resources, applications, and services that run on AWS and on-premises servers, helping you break down data silos so you can easily gain system-wide visibility and quickly resolve issues.
CloudWatch simplifies the collection of technical metrics as it natively integrates with more than 70 AWS services (including Amazon EC2, AWS Lambda, Amazon ECS, Amazon EKS, Amazon DynamoDB, Amazon S3) and automatically publishes detailed 1-minute metrics and custom metrics with up to one-second granularity so you can dive deep into your logs for additional context. You can also use CloudWatch in hybrid cloud architectures by using the CloudWatch agent or API to monitor your on-premises resources.
Visibility into your AWS account activity is a key aspect of
monitoring, security and operational best practices.
AWS CloudTrail is an AWS service that helps you enable
governance, compliance, and operational and risk auditing of your
AWS account. Actions taken by a user, role, or an AWS service are
recorded as events in CloudTrail. Events include actions taken in
the AWS Management Console
CloudWatch and CloudTrail enable you to explore, download, archive, analyze and visualize your events, and respond to account activity across your AWS infrastructure. You can identify who or what took which action, what resources were acted upon, when the event occurred, and other details to help you analyze and respond to activity in your AWS account.
Advance
Once workloads have been designed to provide information necessary to understand their internal state (metrics, logs, and traces), you should seek to ensure that the correct events are being propagated. Track events at the correct level of granularity and implement a mechanism to review and update thresholds, limits, and event handling rules. Centrally collect and store events of interest to help teams to view, investigate, and resolve them. Centralized and standardized experience improves timely management, detection and remediation as well as creating a lower barrier to onboard new operations engineers responsible for managing events.
Business and operational metrics derived from desired business
outcomes enable you to understand the health of your workload,
prioritize operations activities, and respond to events.
Establishing metric baselines helps to improve operations,
investigation, and intervention. Use established
runbooks
You can use CloudWatch anomaly detection to detect anomalous behavior in your environments. When you enable anomaly detection for a metric, CloudWatch applies machine learning (ML) algorithms to the metric's past data to create a model of the metric's expected values. The model assesses both trends and hourly, daily, and weekly patterns of the metric. Use CloudWatch to set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.
Several AWS services publish CloudWatch metrics can be used to gain system-wide visibility into resource utilization, application performance, and operational health. However, with distributed systems that use these services, your application telemetry should capture information to enable situational awareness. Instrumentation requires explicit code that records how long tasks take, how often certain code paths are executed, metadata about what the task was working on, and what parts of the tasks succeeded or failed. Further, it may be important to follow the flow of a request using a trace ID, as the request enters the system and passes through various systems before it is fulfilled. AWS X-Ray makes it easy to analyze the behavior of your distributed applications by providing request tracing, exception collection, and profiling capabilities.
AWS Systems Manager OpsCenter provides a central location to view, investigate, and resolve OpsItems related to AWS resources. OpsCenter is integrated with Amazon EventBridge and CloudWatch and designed to reduce mean time to resolution for issues impacting AWS resources. OpsCenter aggregates information from AWS Config, CloudTrail logs, and Amazon EventBridge events, so you don't have to navigate across multiple console pages during your investigation.
Excel
Customers seeking to accelerate business goals such as availability, MTTD, and MTTR, are often challenged with identifying the correct KPIs. Production applications can experience a wide variety of issues, and proactively identifying all potential operational problems is a time-consuming and challenging task. This is also increasingly common in modern microservice-based architectures with distributed and decoupled components. Metrics and logs need to be gathered from workloads, which humans then need to assess and hypothesize on potential operational problems and resolutions. Knowing what metrics to measure and the purpose they serve, as well as implementing alert governance to separate signal from noise can help reduce alert fatigue for the operators. Too much noise can cause alert fatigue that can lead to alerts being missed or ignored, or to responses being delayed.
Amazon DevOps Guru can be used to provide ML-powered service insights that make it easy for developers and operators to automatically detect issues that can improve application availability and performance. Amazon DevOps Guru, which can be enabled at the AWS account or CloudFormation stack level, can detect issues by correlating metric anomalies, traces, changes, and log events triggered by incidents. The service produces insights which are a collection of identified anomalies containing observations, recommendations, and analytical data that you can use to improve your operational performance.
Each insight contains a list of the metrics and events that were used to identify the unusual behavior, as well as specific recommendations that can help you improve the performance of your application. These insights can be surfaced directly within OpsCenter dashboards (as OpsItems) and integrated into a workflow for team visibility; you can also perform actions such as running runbooks (Systems Manager automation documents). Examples of automatically detected operational issues include missing or misconfigured alarms, early warning signs of resource exhaustion, and code and configuration changes that could lead to outages. DevOps Guru insights can notify engineers using SNS and perform customized actions using Lambda functions for further workflow integration, such as with ticketing systems or AWS Config.