

# Manage workloads in Incident Detection and Response
<a name="manage-workloads-idr"></a>

A key part of effective incident management is having the right processes and procedures in place to onboard, test, and maintain your monitored workloads. This section covers the essential steps, including developing comprehensive runbooks and response plans to guide your teams through incidents, thoroughly testing and validating new workloads before onboarding, requesting changes to update workload monitoring, and properly offboarding workloads when required.

**Topics**
+ [Develop runbooks and response plans](idr-workloads-dev-runbook.md)
+ [Test onboarded workloads](idr-workloads-testing.md)
+ [Request changes to a workload](idr-workloads-change-request.md)
+ [Suppress alarms](idr-workloads-suppress-alarms.md)
+ [Offboard a workload](idr-workloads-offboard.md)

# Develop runbooks and response plans for responding to an incident in Incident Detection and Response
<a name="idr-workloads-dev-runbook"></a>

Incident Detection and Response uses information captured from your onboarding questionnaire to develop runbooks and response plans for the management of incidents affecting your workloads. Runbooks document steps Incident Managers take when responding to an incident. A response plan is mapped to at least one of your workloads. The incident management team creates these templates from the information provided by you during [workload discovery](idr-gs-discovery.md). Response plans are AWS Systems Manager (SSM) document templates used to trigger incidents. To learn more about SSM documents, see [AWS Systems Manager Documents](https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-ssm-docs.html). To learn more about Incident Manager, see [What Is AWS Systems Manager Incident Manager?](https://docs.aws.amazon.com/incident-manager/latest/userguide/index.html)

**Key outputs:**
+ Completion of your workload definition on AWS Incident Detection and Response.
+ Completion of alarms, runbooks and response plan definition on AWS Incident Detection and Response.

You can also download an AWS Incident Detection and Response Runbook example: [aws-idr-runbook-example.zip](samples/aws-idr-runbook-example.zip).

Example runbook:

```
Runbook template for AWS Incident Detection and Response
# Description
This document is intended for [CustomerName] [WorkloadName].  
  
[Insert short description of what the workload is intended for].

## Step: Priority
**Priority actions**
1. When a case is created with Incident Detection and Response, lock the case to yourself, verify the Customer Stakeholders in the Case from *Engagement Plans - Initial Engagement*. 
2. Send the first correspondence on the support case to the customer as below. If there is no support case or if it is not possible to use the support case then backup communication details are listed in the steps that follow.

```
Hello,

This is <<Engineer's name>> from AWS Incident Detection and Response. An alarm has triggered for your workload <<application name>>. I am currently investigating and will update you in a few minutes after I have finished initial investigation.

Alarm Identifier - <insert CloudWatch Alarm ARN or APM Response Identifier>
```

**Compliance and regulatory requirements for the workload**   
<<e.g. The workload deals with patient health records which must be kept secured and confidential. Information not to be shared with any third parties.>>

**Actions required from Incident Detection and Response in complying**   
<<e.g Incident Management Engineers must not shared data with third parties.>>

## Step: Information
**Review of common information**

* This section provides a space for defining common information which may be needed through the life of the incident.  
* The target user of this information is the Incident Management Engineer and Operations Engineer.
* The following steps may reference this information to complete an action (for example, execute the "Initial Engagement" plan).

---
**Engagement plans**

Describe the engagement plans applicable to this runbook. This section contains only contact details. Engagement plans will be referenced in the step by step **Communication Plans**. 

* **Initial engagement** 

AWS Incident Detection and Response Team will add customer stakeholder addresses below to the Support Case. AWS Stakeholders are for additional stakeholders that may need to be made aware of any issues.  
When updating customer stakeholders details in this plan also update the Backup Mailto links. 
  * ***Customer Stakeholders***:  customeremail1; customeremail2; etc
  * ***AWS Stakeholders***: aws-idr-oncall@amazon.com; tam-team-email; etc.   
  * ***One Time Only Contacts***: [These are email contacts that are included on only the first communication. Remove these contacts after the first communication has gone out. These could be customer paging email addresses such as pager-duty that must not be paged for every correspondence]
  * ***Backup Mailto Impact Template***: <*Insert Impact Template Mailto Link here*>
    * Use the backup Mailto when communication over cases is not possible.   
  * ***Backup Mailto No Impact Template***: <*Insert No Impact Mailto Link here*>
    * Use the backup Mailto when communication over cases is not possible.  


* **Engagement Escalation**

AWS Incident Detection and Response will reach out to the following contacts when the contacts from the **Initial engagement** plan do not respond to incidents.   
For each Escalation Contact indicate if they must be added to the support case, phoned or both.  
  * ***First Escalation Contact***: [escalationEmailAddress#1] / [PhoneNumber] - Wait XX Minutes before escalating to this contact. 
    * [add Contact to Case / phone] this contact. 
  * ***Second Escalation Contact***: [escalationEmailAddress#2] / [PhoneNumber] - Wait XX Minutes before escalating to this contact. 
    * [add Contact to Case / phone] this contact. 
  * Etc;
---
**Communication plans**

Describe how Incident Management Engineer communicates with designated stakeholders outside the incident call and communication channels.

* **Impact Communication plan**  
This plan is initiated when Incident Detection and Response have determined from step **Triage** that an alert indicates potential impact to a customer.  
Incident Detection and Response will request the customer to join the predetermined bridge (Chime Bridge/Customer Provided Bridge / Customer Static Bridge) as indicated in **Engagement plans - Incident call setup**.  
All backup email templates for use when cases can't be used are in **Engagement plans - Initial engagement**.  
  * 1 – Before sending the impact notification, verify then remove and/or add customer contacts from the Support Case CC based on the contacts listed in the **Initial engagement** Engagement plan. 
  * 2 – Send the engagement notification to the customer based the following Template:  
    (choose one and remove the rest)  
    ***Impact Template - Chime Bridge***  
    ```
    The following alarm has engaged AWS Incident Detection and Response to an Incident bridge: 
        Alarm Identifier - <insert CloudWatch Alarm ARN or APM Response Identifier>
        Alarm State Change Reason -  <insert state change reason>
        Alarm Start Time - <Example: 1 January 2023, 3:30 PM UTC>
    Please join the Chime Bridge below so we can start the steps outlined in your Runbook: 
        <insert Chime Meeting ID>
        <insert Link to Chime Bridge>
        International dial-in numbers: https://chime.aws/dialinnumbers/ 
    ```
   
    ***Impact Template - Customer Provided Bridge***
    ```
    The following alarm has engaged AWS Incident Detection and Response: 
        Alarm Identifier - <insert CloudWatch Alarm ARN or APM Response Identifier> 
        Alarm State Change Reason - <insert state change reason> 
        Alarm Start Time - <Example: 1 January 2023 3:30 PM UTC>
    Please respond with your internal bridge details so we can join and start the steps outlined in your Runbook. 
    ```  
    ***Impact Template - Customer Static Bridge***  
    ```
    The following alarm has engaged AWS Incident Detection and Response to an Incident bridge: 
        Alarm Identifier - <insert CloudWatch Alarm ARN or APM Response Identifier> 
        Alarm State Change Reason - <insert state change reason> 
        Alarm Start Time - <Example: 1 January 2023, 3:30 PM UTC> 
    Please join the Bridge below so we can start the steps outlined in your Runbook: 
        Conference Number: <insert conference number>
        Conference URL : <insert bridgeURL>
    ```
  * 3 - Set the Case to Pending Customer Action
  * 4 - Follow **Engagement Escalation** plan as mentioned above.
  * 5 - If the customer does not respond within 30 minutes, disengage and continue to monitor until the alarm recovers.
  
* **No Impact Communication plan**  
This plan is initiated when an alarm recovers before Incident Detection and Response have completed initial **Triage**.  
  * 1 - Before sending the no impact notification, verify then remove and/or add customer contacts from the Support Case CC based on the contacts listed in the **Engagement plans - Initial engagement** Engagement plan.
  * 2 - Send a no engagement notification to the customer based on the below template:  
  ***No Impact Template***  
  ```
  AWS Incident Detection and Response received an alarm that has recovered for your workload. 
      Alarm Identifier -  <insert CloudWatch Alarm ARN or APM Response Identifier>
      Alarm State Change Reason -  <insert state change reason>
      Alarm Start Time -  <Example: 1 January 2023, 3:30 PM UTC>
      Alarm End Time -  <Example: 1 January 2023, 3:35 PM UTC>
  This may indicate a brief customer impact that is currently not ongoing. 
  If there is an ongoing impact to your workload, please let us know and we will engage to assist. 
  ```
  * 3 - Put the case in to Pending Customer Action. 
  * 4 - If the customer does not respond within 30 minutes Resolve the case. 
  
* **Updates**  
If AWS Incident Detection and Response is expected to provide regular updates to customer stakeholders, list those stakeholders here. Updates must be sent via the same support case.  
Remove this section if not needed. 
  * Update Cadence: Every XX minutes
  * External Update Stakeholders: customeremailaddress1; customeremailaddress2; etc
  * Internal Update Stakeholders: awsemailaddress1; awsemailaddress2; etc

---
**Application architecture overview**

This section provides an overview of the application/workload architecture for Incident Management Engineer and Operations Engineer awareness.

* **AWS Accounts and Regions with key services** - list of AWS accounts with regions supporting this application. Assists Engineers in assessing underlying infrastructure supporting the application.
    * 123456789012 
      * US-EAST-1 - brief desc as appropriate    
        * EC2 - brief desc as appropriate        
        * DynamoDB - brief desc as appropriate                    
        * etc.
      * US-WEST-1 - brief desc as appropriate              
      * etc.
    * another-account-etc.
    
* **Resource identification** - describe how engineers determine resource association with application
    * Resource groups: etc.
    * Tag key/value: AppId=123456

* **CloudWatch Dashboards** - list dashboards relevant to key metrics and services
  * 123456789012 
    * us-east-1 
      * some-dashboard-name
      * etc.
  * some-other-dashboard-name-in-current-acct
  
## Step: Triage
**Evaluate incident and impact**
This section provides instructions for triaging of the incident to determine correct impact, description, and overall correct runbook being executed.

* **Evaluation of initial incident information**
  * 1 - Review Incident Alarm, noting time of first detected impact as well as the alarm start time. 
  * 2 - Identify which service(s) in the customer application is seeing impact. 
  * 3 - Review AWS Service Health for services listed under **AWS Accounts and Regions with key services**.
  * 4 - Review any customer provided dashboards listed under **CloudWatch Dashboards**

---
* **Impact**  
Impact is determined when either the customer's metrics do not recover, appear to be trending worse or if there is indication of AWS Service Impact. 
  * 1 – Start **Communication plans - Impact Communication plan**
  * 2 - Start **Engagement plans - Engagement Escalation** if no response is received from the **Initial Engagement** contacts.
  * 3 - Start **Communication plans - Updates** if specified in **Communication plans**

* **No Impact**  
No Impact is determined when the customer's alarm recovers before Triage is complete and there are no indications of AWS service impact or sustained impact on the customer's CloudWatch Dashboards. 
  * 1 - Start **Communication plans - No Impact Communication plan**
  
## Step: Investigate
**Investigation**  

  This section describes performing investigation of known and unknown symptoms.

**Known issue**
  * *List all known issues with the application and their standard actions here*

**Unknown issues** 
  * Investigate with the customer and AWS Premium Support.
  * Escalate internally as required.  
  
## Step: Mitigation
**Collaborate**  
* Communicate any changes or important information from the **Investigate** step to the members of the incident call.

**Implement mitigation**  
* ***List customer failover plans / Disaster Recovery plans / etc here for implementing mitigation.
      
## Step: Recovery
**Monitor customer impact**
* Review metrics to confirm recovery.
* Ensure recovery is across all Availability Zones / Regions / Services
* Get confirmation from the customer that impact is over and the application has recovered.

**Identify action items**  
* Record key decisions and actions taken, including temporary mitigation that might have been implemented.
* Ensure outstanding action items have assigned owners.
* Close out any Communication plans that were opened during the incident with a final confirmation of recovery notification.
```

# Test onboarded workloads in Incident Detection and Response
<a name="idr-workloads-testing"></a>

**Note**  
The AWS Identity and Access Management user or role that you use for alarm testing must have `cloudwatch:SetAlarmState` permission.

The last step in the onboarding process is to perform a gameday for your new workload. After alarm ingestion completes, AWS Incident Detection and Response confirms a date and time of your choosing to start your gameday.

Your gameday serves two main purposes:
+ **Functional Validation:** Confirms that AWS Incident Detection and Response can correctly receive your alarm events. And, functional validation confirms that your alarm events trigger the appropriate runbooks and any other desired actions, such as auto case creation if you selected it during alarm ingestion.
+ **Simulation:** The gameday is an end to end simulation of what might happen during a real incident. AWS Incident Detection and Response follows your prescribed runbook steps to give you insight into how a real incident might unfold. The gameday is an opportunity for you to ask questions or refine instructions to improve the engagement.

During the alarm test, AWS Incident Detection and Response works with you to remediate any issues identified.

## CloudWatch alarms
<a name="idr-workloads-testing-cw"></a>

AWS Incident Detection and Response tests your Amazon CloudWatch alarms by monitoring the state change of your alarm. To do this, manually change the alarm to the **Alarm** state using the AWS Command Line Interface. You can also access the AWS CLI from AWS CloudShell. AWS Incident Detection and Response provides you with a list of AWS CLI commands for you to use during testing.

To prevent unwanted actions, for example Amazon EC2 instance restarts, disable any CloudWatch alarm actions before you change the alarm state. You can re-enable CloudWatch alarm actions after the testing completes. To learn more about disabling or enabling alarm actions, see [DisableAlarmActions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DisableAlarmActions.html) and [EnableAlarmActions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_EnableAlarmActions.html) in the *Amazon CloudWatch API Reference*.

Example AWS CLI command to set an alarm state:

```
aws cloudwatch set-alarm-state --alarm-name "ExampleAlarm" --state-value ALARM --state-reason "Testing AWS Incident Detection and Response" --region us-east-1
```

To learn more about manually changing the state of CloudWatch alarms, see [SetAlarmState](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_SetAlarmState.html).

To learn more about the permissions required for CloudWatch API operations, see [Amazon CloudWatch permissions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/permissions-reference-cw.html).

## Third party APM alarms
<a name="idr-workloads-testing-third-party-alarms"></a>

Workloads that utilize a third party Application Performance Monitoring (APM) tool, such as Datadog, Splunk, New Relic, or Dynatrace, require different instructions to simulate an alarm. At the start of the gameday, AWS Incident Detection and Response requests that you temporarily change your alarm thresholds or comparison operators to force the alarm into the **ALARM** status. This status triggers a payload to AWS Incident Detection and Response.

## Key outputs
<a name="idr-workloads-testing-key-outputs"></a>

Key outputs:
+ Alarm ingestion is successful and your alarm configuration is correct.
+ Alarms are successfully created and received by AWS Incident Detection and Response.
+ A support case is created for your engagement and your prescribed contacts are notified.
+ AWS Incident Detection and Response can engage with you by your prescribed conference means.
+ All alarms and support cases generated as part of the gameday are resolved.
+ A Go-Live email is sent confirming your workload is now being monitored by AWS Incident Detection and Response.

# Request changes to an onboarded workload in Incident Detection and Response
<a name="idr-workloads-change-request"></a>

To request changes to an onboarded workload, complete the following steps to create a support case with AWS Incident Detection and Response.

1. Go to the [AWS Support Center](https://console.aws.amazon.com/support/home#/), and then select **Create case**, as shown in the following example:  
![\[AWS Support Center example.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/workload-change-request1.png)

1. Choose **Technical**.

1. For **Service**, choose **Incident Detection and Response**.

1. For **Category**, choose **Workload change request**.

1. For **Severity**, choose **General Guidance**.

1. Enter a **Subject** for this change. For example:

   AWS Incident Detection and Response - *workload\$1name* 

1. Enter a **Description** for this change. For example, enter "This request is for changes to an existing workload onboarded into AWS Incident Detection and Response". Make sure that you include the following information in your request:
   + **Workload name:** Your workload name.
   + **Account ID(s):** ID1, ID2, ID3, and so on.
   + **Change details:** Enter the details for your requested change.

1. In the **Additional contacts - optional** section, enter any email IDs that you want to receive correspondence about this change.

   The following is an example of the **Additional contacts - optional** section.  
![\[Enter contacts in the highlighted Additional contacts - optional section.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/workload-change-request2.png)
**Important**  
Failure to add email IDs in the **Additional contacts - optional** section might delay the change process.

1. Choose **Submit**.

   After you submit the change request, you can add additional emails from your organization. To add emails, choose **Reply** in **Case details**, as shown in the following example:  
![\[The Details page showing the Reply button highlighted.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/workload-change-request3.png)

   Then, add the email IDs in the **Additional contacts - optional** section.

   The following is an example of the **Reply** page showing where you can enter additional emails.  
![\[The Reply page where you can add additional emails.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/workload-change-request4.png)

# Suppress alarms from engaging Incident Detection and Response
<a name="idr-workloads-suppress-alarms"></a>

Specify which of your onboarded workload alarms engage with AWS Incident Detection and Response monitoring by suppressing them temporarily or on a schedule. For example, you might temporarily suppress workload alarms during planned maintenance to prevent the alarms from engaging Incident Detection and Response. Or, you might suppress alarms on a schedule if you have daily reboot activity. You can suppress alarms at the alarm source, such as Amazon CloudWatch, or you can submit a workload change request.

**Topics**
+ [

# Suppress alarms at the alarm source
](suppress-alarms-at-source.md)
+ [

# Submit a workload change request to suppress alarms
](suppress-alarms-at-source-wcr.md)
+ [

# Tutorial: Use a metric math function to suppress an alarm
](suppress-alarms-tutorial-suppress.md)
+ [

# Tutorial: Remove a metric math function to un-suppress an alarm
](suppress-alarms-tutorial-unsuppress.md)

# Suppress alarms at the alarm source
<a name="suppress-alarms-at-source"></a>

Specify which alarms engage with Incident Detection and Response and when they do so by suppressing alarms at the alarm source.

**Topics**
+ [

## Use a metric math function to suppress a CloudWatch alarm
](#suppress-alarms-at-source-cw)
+ [

## Remove a metric math function to un-suppress a CloudWatch alarm
](#suppress-alarms-metric-math-unsuppress)
+ [

## Example metric math functions and associated use cases
](#suppress-alarms-example-functions)
+ [

## Suppress alarms from a third party APM
](#suppress-alarms-third-party-apm)

## Use a metric math function to suppress a CloudWatch alarm
<a name="suppress-alarms-at-source-cw"></a>

To suppress Incident Detection and Response monitoring of Amazon CloudWatch alarms, use a [metric math function](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) to stop CloudWatch alarms from entering the `ALARM` state during a designated window.

**Note**  
Disabling **Alarm actions** on a CloudWatch alarm doesn’t suppress monitoring of your alarms by Incident Detection and Response. Alarm state changes are ingested through Amazon EventBridge, not through CloudWatch alarm actions.

To use a metric math function to suppress a CloudWatch alarm, complete the following steps:

1. Sign in to the AWS Management Console and open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Alarms**, and then locate the alarm that you want to add the metric math function to.

1. Choose **Actions**, then select **Edit** to change the alarm.

1. Choose **Edit metric** to modify the metric for the alarm.

1. Choose **Add math**, **Start with empty expression**.

1. Enter your math expression, then choose **Apply**.

1. Deselect the existing metric that the alarm monitored.

1. Select the expression that you just created, and then choose **Select metric**.

1. Choose **Skip to Preview and create**.

1. Review your changes to make sure that your metric math function is applied as expected, and then choose **Update alarm**.

For a step by step example of suppressing a CloudWatch alarm with a metric math function, see [Tutorial: Use a metric math function to suppress an alarm](suppress-alarms-tutorial-suppress.md).

For more information on syntax and available functions, see [Metric math syntax and functions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#metric-math-syntax) in the *Amazon CloudWatch User Guide*. 

## Remove a metric math function to un-suppress a CloudWatch alarm
<a name="suppress-alarms-metric-math-unsuppress"></a>

Un-suppress a CloudWatch alarm by removing the metric math function. To remove a metric math function from an alarm, complete the following steps:

1. Sign in to the AWS Management Console and open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Alarms**, and then locate the alarm or alarms that you want to remove the metric math expression from.

1. In the metric math section, choose **Edit**.

1. To remove the metric from the alarm, choose **Edit** on the metric, and then choose the **x** button next to the metric math expression.

1. Select the original metric, then choose **Select metric**.

1. Choose **Skip to Preview and create**. 

1. Review your changes to make sure that your metric math function is applied as expected, then choose **Update alarm**.

## Example metric math functions and associated use cases
<a name="suppress-alarms-example-functions"></a>

The following table contains metric math function examples, along with associated use cases and an explanation of each metric component.


| Metric math function | Use case | Explanation | 
| --- | --- | --- | 
|  `IF((DAY(m1) == 2 && HOUR(m1) >= 1 && HOUR(m1) < 3), 0, m1)`  |  Suppress alarm between 1:00 to 3:00 AM UTC every Tuesday by replacing real data points with 0 during this window.   |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 
|  `IF((HOUR(m1) >= 23 \|\| HOUR(m1) < 4), 0, m1)`  |  Suppress alarm between 11:00 PM to 4:00 AM UTC, daily by replacing real data points with 0 during this window.   |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 
|  `IF((HOUR(m1) >= 11 && HOUR(m1) < 13), 0, m1) `  |  Suppress alarm between 11:00 AM to 1:00 PM UTC daily by replacing real data points with 0 during this window.  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 
|  `IF((DAY(m1) == 2 && HOUR(m1) >= 1 && HOUR(m1) < 3), 99, m1)`  |  Suppress alarm between 1:00 to 3:00 AM UTC every Tuesday by replacing real data points with 99 during this window.  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 
|  `IF((HOUR(m1) >= 23 \|\| HOUR(m1) < 4), 100, m1)`  |  Suppress alarm between 11:00 PM to 4:00 AM UTC, daily by replacing real data points with 100 during this window.   |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 
|  `IF((HOUR(m1) >= 11 && HOUR(m1) < 13), 99, m1) `  |  Suppress alarm between 11:00 AM to 1:00 PM UTC daily by replacing real data points with 99 during this window.   |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/IDR/latest/userguide/suppress-alarms-at-source.html)  | 

## Suppress alarms from a third party APM
<a name="suppress-alarms-third-party-apm"></a>

Refer to your third party APM vendor’s documentation for instructions on how to suppress alarms. Examples of third party APM vendors are New Relic, Splunk, Dynatrace, Datadog, and SumoLogic.

# Submit a workload change request to suppress alarms
<a name="suppress-alarms-at-source-wcr"></a>

If you can’t suppress alarms at the source as described in the previous section, then submit a Workload Change Request to instruct Incident Detection and Response to manually suppress monitoring of some or all of your workload’s alarms.

For detailed instructions on how to create a Workload Change Request, see [Request changes to an onboarded workload in Incident Detection and Response](https://docs.aws.amazon.com/IDR/latest/userguide/idr-workloads-change-request.html). When raising a Workload Change Request to request suppression of your alarms, make sure that you provide the following required information
+ **Workload name:** Your workload name.
+ **Account ID(s):** ID1, ID2, ID3, and so on.
+ **Change details:** Alarm Suppression
+ **Suppression start time:** Date, time, and time zone.
+ **Suppression end time:** Date, time, and time zone.
+ **Alarms to suppress:** A list of CloudWatch alarm ARNs or third party APM event identifiers to suppress.

After you create the alarm suppression Workload Change Request, you receive the following notifications from Incident Detection and Response:
+ Acknowledgement of your Workload Change Request.
+ Notification when alarms are suppressed.
+ Notification when alarms are re-enabled for monitoring.

# Tutorial: Use a metric math function to suppress an alarm
<a name="suppress-alarms-tutorial-suppress"></a>

The following tutorial walks you through how to suppress a CloudWatch alarm using metric math.

**Example scenario**

There's a planned activity that takes place between 1:00 to 3:00 AM UTC on the upcoming Tuesday. You want to create a CloudWatch metric math function that replaces the real data points during this time, with 0 (a data point that falls below the set threshold). 

1. Assess the criteria that causes your alarm to trigger. The following screenshot provides an example of alarm criteria:  
![\[CloudWatch screen showing alarm details.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-assess-alarm-criteria.png)

   The alarm shown in the preceding screenshot monitors the `UnHealthyHostCount` metric for an Application Load Balancer target group. This alarm enters the `ALARM` state when the `UnHealthyHostCount` metric is greater than or equal to 3 for 5 out of 5 data points. The alarm treats missing data as bad (breaching the configured threshold).

1. Create the metric math function.

   In this example, the planned activity takes place between 1:00 to 3:00 AM UTC on the upcoming Tuesday. So, create a CloudWatch metric math function that replaces the real data points during this time, with 0 (a data point that falls below the set threshold). 

   Note that the replacement data point that you must configure differs depending on your alarm configuration. For example, if you have an alarm that monitors HTTP success rate, with a threshold of less than 98, then replace your real data points during the planned activity with a value above the configured threshold, 100. The following is an example metric math function for this scenario.

   ```
   IF((DAY(m1) == 2 && HOUR(m1) >= 1 && HOUR(m1) < 3), 0, m1)
   ```

   The preceding metric math function contains the following elements:
   + **DAY(m1) == 2**: Ensures that it's Tuesday (Monday = 1, Sunday = 7).
   + **HOUR(m1) >= 1 && HOUR(m1) < 3**: Specifies the time range from 1 AM to 3 AM UTC.
   + **IF(condition, value\$1if\$1true, value\$1if\$1false)**: If the conditions are true, the function replaces the metric value with 0. Otherwise, the original value (m1) is returned.

   For additional information on syntax and available functions, see [Metric math syntax and functions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#metric-math-syntax) in the *Amazon CloudWatch User Guide*

1. Sign in to the AWS Management Console and open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Alarms**, and then locate the alarm that you want to add the metric math function to.

1. In the metric math section, choose **Edit**.

1. Choose **Add math**, **Start with empty expression**.

1. Enter your math expression, and then choose **Apply**.

   The existing metric that the alarm monitors automatically becomes **m1** and your math expression is **e1**, as shown in the following example:  
![\[CloudWatch screen showing metric math expressions.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-expression.png)

1. (Optional) Edit the label of the metric math expression to help others understand it’s function and why it was created, as shown in the following example:  
![\[CloudWatch screen showing editing of a metric match expression label.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-edit-label.png)

1. Deselect **m1**, select **e1**, and then choose **Select metric**. This sets the alarm to monitor the math expression instead of the underlying metric directly.

1. Choose **Skip to Preview and create**.

1. Validate that the alarm is configured as expected, then choose **Update alarm to save the change**.

In the preceding example, without the metric math function applied, the real `UnHealthyHostCount` metric would have been reported during the planned activity. This would have resulted in the CloudWatch alarm entering the `ALARM` state and engaging Incident Detection and Response, as shown in the following example:

![\[CloudWatch screen showing data points leading to an alarm state.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-example-alarm-state.png)


With the metric math function in place, the real data points are replaced with 0 during the activity, and the alarm remains in the `OK` state, suppressing Incident Detection and Response engagement. 

![\[CloudWatch screen showing data points with no alarm state.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-datapoints-no-alarm.png)


# Tutorial: Remove a metric math function to un-suppress an alarm
<a name="suppress-alarms-tutorial-unsuppress"></a>

If you suppress a CloudWatch alarm for a one-time activity, remove the metric math function from the alarm after the activity completes to resume regular monitoring of the alarm. To suppress the alarm on a regular schedule, for example, if you have a scheduled weekly patching routine that results in instance reboots on the same day and time each week, then leave the metric math function in place.

The following tutorial walks you through how to remove a metric math function to un-suppress a CloudWatch alarm 

1. Sign in to the AWS Management Console and open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Alarms**, and then locate the alarm that you want to add the metric math function to.

1. In the metric math section, choose **Edit**.

1. To remove the suppression from the alarm, select the **x** button next to the metric math expression.  
![\[CloudWatch screen showing the x button to remove a metric math function.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-unsuppress.png)

1. Select the metric to resume monitoring of the real metric. then choose **Select metric**.  
![\[CloudWatch screen showing the Select metric button.\]](http://docs.aws.amazon.com/IDR/latest/userguide/images/metric-math-unsuppress-2.png)

1. Choose **Skip to Preview and create**.

1. Validate that the alarm is configured as expected, then choose **Update alarm to save the change**.

# Offboard a workload from Incident Detection and Response
<a name="idr-workloads-offboard"></a>

To offboard a workload from AWS Incident Detection and Response, create a new support case for each workload. When you create the support case, keep the following in mind:
+ To offboard a workload that's in a single AWS account, create the support case either from the workload's account or from your payer account.
+ To offboard a workload that spans multiple AWS accounts, then create the support case from your **payer account**. In the body of the support case, list all account IDs to offboard.

**Important**  
If you create a support case to offboard a workload from the incorrect account, you might experience delays and requests for additional information before your workloads can be offloaded.

**Request to offboard a workload**

1. Go to the [AWS Support Center](https://console.aws.amazon.com/support/home#/), and then select **Create case**.

1. Choose **Technical**.

1. For **Service**, choose **Incident Detection and Response**.

1. For **Category**, choose **Workload Offboarding**.

1. For **Severity**, choose **General Guidance**.

1. Enter a **Subject** for this change. For example:

   [Offboard] AWS Incident Detection and Response - *workload\$1name*

1. Enter a **Description** for this change. For example, enter "This request is for offboarding an existing workload onboarded into AWS Incident Detection and Response". Make sure that you include the following information in your request:
   + **Workload name:** Your workload name.
   + **Account ID(s):** ID1, ID2, ID3, and so on.
   + **Reason for offboarding:** Provide a reason for offboarding the workload.

1. In the **Additional contacts - optional** section, enter any email IDs that you want to receive correspondence about this offboarding request.

1. Choose **Submit**.