Workload onboarding and alarm ingestion questionnaires in Incident Detection and Response (exception path)
Note
If you can't use the IDR CLI to onboard your workload, use the following questionnaires for workload and alarm onboarding.
This topic provides the questionnaires you need to complete when onboarding a workload to AWS Incident Detection and Response and when configuring alarms to ingest into the service. The workload onboarding questionnaire covers general information about your workload, its architecture details, and contacts for incident response. In the alarm ingestion questionnaire, you specify the critical alarms that trigger incident creation in Incident Detection and Response for your workload, as well as runbook information on who to contact and what actions to take. Properly completing these questionnaires is a key step in setting up monitoring and incident response processes for your AWS workloads.
Download the Workload onboarding questionnaire:
Download the Alarm ingestion questionnaire:
Workload onboarding questionnaire - General questions
| Question | Example Response |
|---|---|
| Enterprise Name | Amazon Inc. |
| Name of this workload (include any abbreviations) | Amazon Retail Operations (ARO) |
| Primary end user and the function of this workload. | This workload is an e-commerce application that allows end users to purchase various items. This workload is the primary revenue generator for our business. |
Workload onboarding questionnaire - Architecture questions
| Question | Example Response |
|---|---|
A list of AWS resource tags used to define resources that are part of this workload. AWS uses these tags to identify this workload's resources to expedite support during incidents. NoteTags are case sensitive. If you provide multiple tags, all resources used by this workload must have the same tags. |
appName: Optimax environment: Production |
A list of AWS service(s) utilized by this workload, the AWS account(s) and AWS Region(s) that they are in. |
AWS services: RouteĀ 53, ALB, ECS, ... Accounts: 123456789101, 123456789102, ... Regions: US-EAST-1, US-WEST-2, ... |
Alarm ingestion questionnaire - Overview
In the alarm ingestion questionnaire, you specify the critical alarms for your workload that you want to engage AWS Incident Detection and Response, as well as the contacts you want an Incident Management Engineer to engage when these alarms trigger.
The Alarm Ingestion Questionnaire is divided into the following sections:
Contact section: First, specify the primary contact(s) to be included on the Support Case created with AWS Incident Detection and Response when an alarm triggers, as well as your preferred conferencing application for incident bridges. If no bridge preference is provided, AWS Incident Detection and Response will create an incident bridge during incidents. Next, specify escalation contacts and time intervals to engage them when primary contacts are unreachable. Finally, list any contacts who should receive regular incident status updates through the support case for the duration of an incident.
Alarm matrix: List the set of alarms that will engage AWS Incident Detection and Response when triggered. See the "Critical Alarm Criteria" defined by AWS Incident Detection and Response when selecting alarms for onboarding. For more information, see Alarm definition.
Amazon CloudWatch Alarms (leave this section blank if you don't have Amazon CloudWatch alarms)
Third party APM alarms (leave this section blank if you don't have Third party APM alarms)
EventBridge EventBus ARN: This is the ARN of the custom EventBus ARN that you created in Ingest Alarms from APMs with direct EventBridge integration or Ingest alarms from APMs without direct integration with EventBridge.
Alarm Identifiers: Share the account number, region, and name of the APM alarm.
Alarm ingestion questionnaire - Runbook questions
| Question | Example Response |
|---|---|
AWS engages workload contacts through the Support case. Who is the primary contact when an alarm triggers for this workload? Specify your preferred conferencing application and AWS will request these details during an incident. NoteIf a preferred conferencing application isn't provided, then AWS will reach out during an incident and provide a Chime bridge for you to join. |
Application Team app@example.com +61 2 3456 7890 |
If the primary contact is unavailable during an incident, please provide escalation contacts and timeline in the preferred communication order. |
1. After 10 minutes, if no response from Primary Contact, engage: John Smith - Application Supervisor john.smith@example.com +61 2 3456 7890 2. After 10 minutes, if no response from John Smith, contact: Jane Smith - Operations Manager jane.smith@example.com +61 2 3456 7890 |
Alarm matrix
Provide the following information to identify the set of alarms that will engage AWS Incident Detection and Response to create incidents on behalf of your workload. Once engineers from AWS Incident Detection and Response have reviewed your alarms, additional onboarding steps will be delivered.
AWS Incident Detection and Response Critical alarm criteria:
AWS Incident Detection and Response alarms should only enter "Alarm" state upon significant business impact to the monitored workload (loss of revenue/degraded customer experience) that requires immediate operator attention.
AWS Incident Detection and Response alarms must also engage your resolvers for the workload at the same time or prior to engagement. AWS Incident Managers collaborate with your resolvers in the mitigation process, and do not serve as a first-line responders who then escalate to you.
AWS Incident Detection and Response alarm thresholds must be set to an appropriate threshold and duration so that any time an alarm fires an investigation must take place. If an alarm is moving between the "Alarm" and "OK" state, sufficient impact is occurring to warrant operator response and attention.
AWS Incident Detection and Response Policy for criteria violations:
These criteria can only be evaluated on a case-by-case basis as events occur. The Incident Management team works with your technical account managers (TAMs) to adjust alarms and in rare cases disable monitoring if it is suspected that customer alarms do not adhere to this criteria and is engaging the Incident Management team unnecessarily at a regular rate.
Important
Provide a group distribution email addresses when supplying contact addresses, so that you can control recipient additions and deletions without runbook updates.
Provide the contact phone number for your site reliability engineering (SRE) team if you would like the AWS Incident Detection and Response team to call them after sending an initial engagement email.
CloudWatch alarm ARN |
Primary contact for this alarm. (If different from workload primary contact) |
Specify the most relevant AWS service for this alarm to engage the right engineer. Enter N/A if not needed. |
Example:
|
Example: Sam Smith - Application Manager sam.smith@example.com +61 2 3456 7890 |
Example: ECS |
EventBridge Event Bus ARN (This is created as part of the third-party APM integration to route alerts to AWS Incident Detection and Response.) |
Example: (There will be an event bus per Account/Region combination)
|
||
Alarm Identifier |
What does this metric represent? Why is this alarm important? |
Primary contact for this alarm. (If different from workload primary contact) |
Specify the most relevant AWS service for this alarm to engage the right engineer. Enter N/A if not needed. |
Example: ALB_5xx_Target_Response Account ID: 123456789012 Region: us-east-1 |
Example: This metric represents transaction responses from the targets behind the ALB. If 5XX errors exceeds threshold, it represents a critical failure to process business transactions. |
Example: Sam Smith - Application Manager sam.smith@example.com +61 2 3456 7890 |
Example: ECS |