使用事件偵測與回應進行事件管理
AWS 事件偵測與回應提供全年無休的主動監控,以及由指定的事件管理者團隊提供的事件管理。下圖概述應用程式警示觸發事件時的標準事件管理程序,包括警示產生、AWS Incident Manager 參與、事件解決和事件後檢討。
警示產生:工作負載上觸發的警示會透過 Amazon EventBridge 推送至 AWS 事件偵測與回應。AWS 事件偵測與回應會自動叫出與您的警示相關聯的執行手冊,並通知事件管理者。如果您的工作負載發生重大事件,而 AWS 事件偵測與回應所監控的警示未偵測到該事件,則您可以建立支援案例來請求事件回應。如需請求事件回應的詳細資訊,請參閱 請求事件回應。
AWS Incident Manager 參與:Incident Manager 會回應警示,並讓您參與電話會議,或依照執行手冊中的指示進行。事件管理員會驗證 AWS 服務 的運作狀態,以判斷警示是否與工作負載所使用 AWS 服務 的問題有關,並建議基礎服務的狀態。如有需要,事件管理者會代表您建立案例,並請適當的 AWS 專家提供支援。由於 AWS 事件偵測與回應專門監控您應用程式的 AWS 服務,因此 AWS 事件偵測與回應可能會在 AWS 服務 事件宣告之前就判斷出事件是否與 AWS 服務 問題相關。在此案例中,事件管理者會告知您 AWS 服務 的狀態、觸發 AWS 服務 事件事件管理工作流程,並跟進服務團隊後續解決的狀況。提供的資訊可讓您及早實施復原計畫或解決措施,以緩解 AWS 服務 事件的影響。
有時警示會觸發並快速復原。在這種情況下,Incident Manager 會傳送案例通訊,指出警示已復原,但不會與您聯繫。不過,若警示在 15 分鐘內觸發一次以上,即使警示已復原,Incident Manager 仍會根據您的執行手冊指示與您聯繫。
解決事件:事件管理者會在必要的 AWS 團隊之間協調事件,並確保您持續與適當的 AWS 專家互動,直到事件獲得緩解或解決為止。
事件後檢討 (若有請求):在事件之後,AWS 事件偵測與回應可依您請求執行事件後檢討,並產生事件後報告。事件後報告包含問題的說明、影響、參與的團隊,以及採取了哪些解決措施或行動來緩解或解決事件。事件後報告可能包含一些資訊,可用於降低事件再次發生的可能性,或改善未來類似事件發生時的管理。事件後報告不是根本原因分析 (RCA)。除了事件後報告之外,您還可以請求 RCA。下一節將提供事件後報告的範例。
重要
下列報告範本僅為範例。
Post ** Incident ** Report ** Template Post Incident Report - 0000000123 Customer: Example Customer AWS 支援 case ID(s): 0000000000 Customer internal case ID (if provided): 1234567890 Incident start: 2023-02-04T03:25:00 UTC Incident resolved: 2023-02-04T04:27:00 UTC Total Incident time: 1:02:00 s Source Alarm ARN: arn:aws:cloudwatch:us-east-1:000000000000:alarm:alarm-prod-workload-impaired-useast1-P95 Problem Statement: Outlines impact to end users and operational infrastructure impact. Starting at 2023-02-04T03:25:00 UTC, the customer experienced a large scale outage of their workload that lasted one hour and two minutes and spanning across all Availability Zones where the application is deployed. During impact, end users were unable to connect to the workload's Application Load Balancers (ALBs) which service inbound communications to the application. Incident Summary: Summary of the incident in chronological order and steps taken by AWS Incident Managers to direct the incident to a path to mitigation. At 2023-02-04T03:25:00 UTC, the workload impairments alarm triggered a critical incident for the workload. AWS Incident Detection and Response Managers responded to the alarm, checking AWS service health and steps outlined in the workload’s runbook. At 2023-02-04T03:28:00 UTC, ** per the runbook, the alarm had not recovered and the Incident Management team sent the engagement email to the customer’s Site Reliability Team (SRE) team, created a troubleshooting bridge, and an 支援 support case on behalf of the customer. At 2023-02-04T03:32:00 UTC, ** the customer’s SRE team, and 支援 Engineering joined the bridge. The Incident Manager confirmed there was no on-going AWS impact to services the workload depends on. The investigation shifted to the specific resources in the customer account. At 2023-02-04T03:45:00 UTC, the Cloud Support Engineer discovered a sudden increase in traffic volume was causing a drop in connections. The customer confirmed this ALB was newly provisioned to handle an increase in workload traffic for an on-going promotional event. At 2023-02-04T03:56:00 UTC, the customer instituted back off and retry logic. The Incident Manager worked with the Cloud Support Engineer to raise an escalation a higher support level to quickly scale the ALB per the runbook. At 2023-02-04T04:05:00 UTC, ALB support team initiates scaling activities. The back-off/retry logic yields mild recovery but timeouts are still being seen for some clients. By 2023-02-04T04:15:00 UTC, scaling activities complete and metrics/alarms return to pre-incident levels. Connection timeouts subside. At 2023-02-04T04:27:00 UTC, per the runbook the call was spun down, after 10 minutes of recovery monitoring. Full mitigation is agreed upon between AWS and the customer. Mitigation: Describes what was done to mitigate the issue. NOTE: this is not a Root Cause Analysis (RCA). Back-off and retries yielded mild recovery. Full mitigation happened after escalation to ALB support team (per runbook) to scale the newly provisioned ALB. Follow up action items (if any): Action items to be reviewed with your Technical Account Manager (TAM), if required. Review alarm thresholds to engage AWS Incident Detection and Response closer to the time of impact. Work with AWS 支援 and TAM team to ensure newly created ALBs are pre-scaled to accommodate expected spikes in workload traffic.