AWSSupport-TroubleshootCloudWatchAlarm
Description
The AWSSupport-TroubleshootCloudWatchAlarm runbook helps identify and
troubleshoot issues with misconfigured or problematic Amazon CloudWatch (CloudWatch) Alarms. It leverages
public AWS APIs and known alarm evaluation logic to detect delayed or missing datapoints
in the monitored metrics, which can lead to missed or delayed alarm actions. This runbook
provides a structured approach to investigate and resolve Amazon CloudWatch (CloudWatch) Alarm-related
problems.
How does it work?
The runbook AWSSupport-TroubleshootCloudWatchAlarm performs the following
steps:
-
Verifies the Amazon CloudWatch (CloudWatch) alarm details and the value of the
AlarmTriggerTimestampparameter to check if it's within 2,592,000 seconds (30 days). -
Checks if an alarm is based on a Metric or Metric Math or is an Anomaly Detector Alarm.
-
Checks if an alarm is in insufficient data sate.
-
Checks if the metric(s) used in the alarm matches with
ListMetricsvalue. -
Verifies if a metric was missing datapoint(s) at a given timestamp.
-
Gets the most recent history for a given timestamp.
-
Checks if an alarm did not trigger due to a delayed or missed metric(s).
-
Checks if an alarm's enabled action(s) was/were delivered.
-
Generates a troubleshooting report combining all diagnostic results.
Document type
Automation
Owner
Amazon
Platforms
Linux, macOS, Windows
Parameters
Required IAM permissions
The AutomationAssumeRole parameter requires the following actions to
use the runbook successfully.
-
cloudwatch:DescribeAlarms -
cloudwatch:DescribeAlarmHistory -
cloudwatch:DescribeAnomalyDetectors -
cloudwatch:GetMetricData -
cloudwatch:GetMetricStatistics -
cloudwatch:ListMetrics
Instructions
Follow these steps to configure the automation:
-
Navigate to
AWSSupport-TroubleshootCloudWatchAlarmin Systems Manager under Documents. -
Select Execute automation.
-
For the input parameters, enter the following:
-
AutomationAssumeRole (Optional):
-
Type:
String -
Description: (Optional) The Amazon Resource Name (ARN) of the AWS AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation uses the permissions of the user who starts this runbook.
-
-
CloudWatchMetricAlarmName (Required):
-
Type:
String -
Description: (Required) The name of the Amazon CloudWatch (CloudWatch) metric Alarm to troubleshoot.
-
Allowed Pattern:
^[a-zA-Z0-9.:;,\\-_&() ]{1,255}$
-
-
AlarmTriggerTimestamp (Required):
-
Type:
String -
Description: (Required) The UTC timestamp when the Alarm issue occurred. This information is crucial for troubleshooting the issue and understanding the context in which it happened. The timestamp value should be a time within the last 30 days from today and in the format
YYYY-MM-DDTHH:mm:ssZ. Example:2024-10-29T09:04:00Z -
Allowed Pattern:
^(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2})Z$
-
-
-
Select Execute.
-
The automation initiates.
-
The document performs the following steps:
-
VerifyRunbookInputsVerifies the Amazon CloudWatch (CloudWatch) alarm details and the value of the
AlarmTriggerTimestampparameter to check if it's within 2,592,000 seconds (30 days). -
UpdateSSMDocumentInputChecksVariableUpdates the variable
SSMDocumentInputCheckswith valueSSMDocumentInputChecksfromVerifyRunbookInputsstep. -
BranchOnAlarmIsVerifiedBranches on Runbook's inputs verification
AlarmTriggerTimestampandCloudWatchAlarmName. -
CheckMetricAlarmTypeChecks if an alarm is based on a Metric or Metric Math or is an Anomaly Detector Alarm.
-
CheckAlarmInInsufficientDataStateChecks if an alarm is in insufficient data sate.
-
UpdateInsufficientDataChecksVariableUpdates the variable
InsufficientDataCheckswith valueInsufficientDataChecksfromCheckAlarmInInsufficientDataStatestep. -
BranchOnAlarmHasInsufficientDataBranches on the
AlarmHasInsufficientDatavalue fromCheckAlarmInInsufficientDataStatestep, the default step isCheckMetricMismatch. -
CheckMetricMismatchChecks if the metric(s) used in the alarm matches with
ListMetricsvalue. -
UpdateMetricMismatchChecksVariableUpdates the variable
MetricMismatchCheckswith valueMetricMismatchChecksfromCheckMetricMismatchstep. -
BranchOnMetricsMatchedBranches on the
MetricsMatchedvalue fromCheckMetricMismatchstep, the default step isCheckMissingDatapoint. -
CheckMissingDatapointVerifies if a metric was missing datapoint(s) at a given timestamp.
-
UpdateMetricMissingDatapointsChecksVariableUpdates the variable
MetricMissingDatapointsCheckswith valueMetricMissingDatapointsChecksfromCheckMissingDatapointstep. -
BranchOnMetricMissingDatapointBranches on the
MetricMissingDatapointvalue fromCheckMissingDatapointstep, the default step isGetAlarmHistoryDetails. -
GetAlarmHistoryDetailsGets the most recent history for a given timestamp.
-
UpdateAlarmHistoryChecksVariableUpdates the variable
AlarmHistoryCheckswith valueAlarmHistoryChecksfromGetAlarmHistoryDetailsstep. -
BranchOnAlarmHistoryFoundBranches on the
AlarmHistoryFoundvalue fromGetAlarmHistoryDetailsstep, the default step isCheckDelayedMetric. -
CheckDelayedMetricChecks if an alarm did not trigger due to a delayed or missed metric(s).
-
UpdateDelayedMetricChecksVariableUpdates the variable
DelayedMetricCheckswith valueDelayedMetricChecksfromCheckDelayedMetricstep. -
BranchOnMetricDelayedAndDatapointsMeetThresholdBranches on the
MetricDelayedandDatapointsMeetThresholdvalues fromCheckDelayedMetricstep, the default step isGenerateReport. -
CheckActionDeliveredChecks if an alarm's enabled action(s) was/were delivered.
-
UpdateActionDeliveredChecksVariableUpdates the variable
ActionDeliveredCheckswith outputActionDeliveredChecksfromCheckActionDeliveredstep. -
GenerateReportCompiles the output of the previous steps and outputs a report.
-
-
After the execution completes, review the Outputs section for the detailed results of the execution:
-
GenerateReport.Report
A report of the provided Amazon CloudWatch (CloudWatch) metric Alarm.
------------------------------------------------------------------------------------------ | AWS CloudWatch Alarm Troubleshooting Results | ------------------------------------------------------------------------------------------ | Alarm Name - Demo-Alarm | | Timestamp - 2025-03-04T06:31:00Z | ------------------------------------------------------------------------------------------ | ✅ No Issue(s) Found | ------------------------------------------------------------------------------------------ ========================================================================================== 1. Validating SSM Document input parameters: ========================================================================================== ✅ [PASSED]: Found a metric alarm with name Demo-Alarm ========================================================================================== 2. Checking alarm's data state: ========================================================================================== ✅ [PASSED]: The alarm is not in INSUFFICIENT_DATA state, alarm's state is: ALARM ========================================================================================== 3. Checking if the alarm experienced metric mismatches: ========================================================================================== ✅ [PASSED]: Metric matches with the configured metric for Alarm. ========================================================================================== 4. Checking if the alarm's metric(s) experienced missing datapoint(s): ========================================================================================== ✅ [PASSED]: Metric has datapoints ========================================================================================== 5. Retrieving alarm's history for timestamp 2025-03-04T06:31:00Z: ========================================================================================== ✅ [PASSED]: Found most recent alarm history item for the provided timestamp: '2025-03-04T06:31:00Z' ========================================================================================== 6. Checking if the alarm experienced metric delays or the alarm's datapoint(s) did not meet the configured threshold: ========================================================================================== ✅ [PASSED]: CloudWatch alarm did not experience any delayed metric ========================================================================================== 7. Checking if the alarm has actions enabled and if action(s) were delivered: ========================================================================================== ✅ [PASSED]: Successfully executed action arn:aws:sns:us-east-1:12345678910:Demo_Alarms_Topic ------------------------------------------------------------------------------------------ ✅ All the checks have passed for CloudWatch alarm, Demo-Alarm, the alarm's configuration is correct. -
References
Systems Manager Automation