View a markdown version of this page

通过事件检测及响应服务进行事件管理 - AWS 事件检测及响应服务用户指南

通过事件检测及响应服务进行事件管理

AWS 事件检测及响应服务通过指定的事件经理团队为您提供每周 7 天、每天 24 小时的主动监控和事件管理。下图概述了应用程序警报触发事件后的标准事件管理流程,包括警报生成、AWS 事件经理参与、事件解决以及事后审查。

标准事件管理流程图
  1. 警报生成:您工作负载上触发的警报将通过 Amazon EventBridge 推送给 AWS 事件检测及响应服务。AWS 事件检测及响应服务会自动调出与您的警报相关的运行手册并通知事件经理。如果您的工作负载上发生了严重事件,但 AWS 事件检测及响应服务监控的警报未检测到,则您可以创建支持案例来发送事件响应请求。有关发送事件响应请求的更多信息,请参阅创建事件响应请求

  2. AWS 事件经理参与:事件经理会对警报做出回应,并与您进行电话会议或按照运行手册中规定的其它方式与您取得联系。事件经理会验证 AWS 服务的运行状况,以确定警报是否是关于工作负载所使用的 AWS 服务的问题,并就底层服务的状态提供建议。如果需要,事件经理会代表您创建案例,并联系相应的 AWS 专家来提供支持。由于 AWS 事件检测及响应服务专门针对您的应用程序监控 AWS 服务,因此 AWS 事件检测及响应服务可能会在宣布 AWS 服务事件之前确定事件与 AWS 服务问题有关。在这种情况下,事件经理会就 AWS 服务的状态向您提供建议,触发 AWS 服务事件管理工作流程,并跟进服务团队的事件解决情况。所提供的信息让您有机会尽早实施恢复计划或解决办法,以减轻 AWS 服务事件的影响。

    有时警报会触发并迅速恢复。在这种情况下,事件经理会发送一封案例信函,说明警报已恢复,但不会与您接洽。但是,如果警报在 15 分钟内多次触发,则即使警报恢复,事件经理也会按照运行手册的说明与您接洽。

  3. 事件解决:事件经理会在所需的 AWS 团队之间协调事件,并确保您与合适的 AWS 专家保持联系,直到事件得到缓解或解决。

  4. 事后审查(根据请求):事件发生后,AWS 事件检测及响应服务会根据您的请求进行事后审查,并生成事后报告。事后报告包括问题描述、事件造成的影响、参与的团队以及为缓解或解决事件而采取的解决办法或措施。事故后报告可能包含如何降低事件再次发生的可能性或如果未来再发生类似事件如何改进管理的信息。事故后报告不是根本原因分析(RCA)。除了事后报告外,您还可以请求 RCA。下面提供了事后报告的示例。

重要

以下报告模板仅供参考。

Post ** Incident ** Report ** Template Post Incident Report - 0000000123 Customer: Example Customer AWS 支持 case ID(s): 0000000000 Customer internal case ID (if provided): 1234567890 Incident start: 2023-02-04T03:25:00 UTC Incident resolved: 2023-02-04T04:27:00 UTC Total Incident time: 1:02:00 s Source Alarm ARN: arn:aws:cloudwatch:us-east-1:000000000000:alarm:alarm-prod-workload-impaired-useast1-P95 Problem Statement: Outlines impact to end users and operational infrastructure impact. Starting at 2023-02-04T03:25:00 UTC, the customer experienced a large scale outage of their workload that lasted one hour and two minutes and spanning across all Availability Zones where the application is deployed. During impact, end users were unable to connect to the workload's Application Load Balancers (ALBs) which service inbound communications to the application. Incident Summary: Summary of the incident in chronological order and steps taken by AWS Incident Managers to direct the incident to a path to mitigation. At 2023-02-04T03:25:00 UTC, the workload impairments alarm triggered a critical incident for the workload. AWS Incident Detection and Response Managers responded to the alarm, checking AWS service health and steps outlined in the workload’s runbook. At 2023-02-04T03:28:00 UTC, ** per the runbook, the alarm had not recovered and the Incident Management team sent the engagement email to the customer’s Site Reliability Team (SRE) team, created a troubleshooting bridge, and an 支持 support case on behalf of the customer. At 2023-02-04T03:32:00 UTC, ** the customer’s SRE team, and 支持 Engineering joined the bridge. The Incident Manager confirmed there was no on-going AWS impact to services the workload depends on. The investigation shifted to the specific resources in the customer account. At 2023-02-04T03:45:00 UTC, the Cloud Support Engineer discovered a sudden increase in traffic volume was causing a drop in connections. The customer confirmed this ALB was newly provisioned to handle an increase in workload traffic for an on-going promotional event. At 2023-02-04T03:56:00 UTC, the customer instituted back off and retry logic. The Incident Manager worked with the Cloud Support Engineer to raise an escalation a higher support level to quickly scale the ALB per the runbook. At 2023-02-04T04:05:00 UTC, ALB support team initiates scaling activities. The back-off/retry logic yields mild recovery but timeouts are still being seen for some clients. By 2023-02-04T04:15:00 UTC, scaling activities complete and metrics/alarms return to pre-incident levels. Connection timeouts subside. At 2023-02-04T04:27:00 UTC, per the runbook the call was spun down, after 10 minutes of recovery monitoring. Full mitigation is agreed upon between AWS and the customer. Mitigation: Describes what was done to mitigate the issue. NOTE: this is not a Root Cause Analysis (RCA). Back-off and retries yielded mild recovery. Full mitigation happened after escalation to ALB support team (per runbook) to scale the newly provisioned ALB. Follow up action items (if any): Action items to be reviewed with your Technical Account Manager (TAM), if required. Review alarm thresholds to engage AWS Incident Detection and Response closer to the time of impact. Work with AWS 支持 and TAM team to ensure newly created ALBs are pre-scaled to accommodate expected spikes in workload traffic.