Amazon CloudWatch alarms for cluster metrics
AWS ParallelCluster configures Amazon CloudWatch alarms to monitor the health and resource utilization of the head node.
Alarms are named ,
where cluster-name-HeadNode-metriccluster-name is the name of your cluster and metric
identifies the metric being monitored.
Access the alarms in the CloudWatch console by choosing Alarms in the navigation pane.
A composite alarm named enters the
cluster-name-HeadNodeALARM state when any of the individual head node alarms triggers.
Disk and memory alarms
Starting with AWS ParallelCluster version 3.6.0, the following CloudWatch alarms are created:
-
— Monitors the root volumecluster-name-HeadNode-Diskdisk_used_percentmetric. Enters theALARMstate when disk usage is greater than 90% for 1 data point within a 1 minute period. -
— Monitors thecluster-name-HeadNode-Memmem_used_percentmetric. Enters theALARMstate when memory usage is greater than 90% for 1 data point within a 1 minute period.
For more information, see Metrics collected by the CloudWatch agent in the Amazon CloudWatch User Guide.
Health check and CPU alarms
Starting with AWS ParallelCluster version 3.8.0, the following CloudWatch alarms are created:
-
— Monitors the Amazon EC2cluster-name-HeadNode-HealthStatusCheckFailedmetric. Enters theALARMstate when the value is greater than 0 for 1 data point within a 1 minute period. -
— Monitors the Amazon EC2cluster-name-HeadNode-CpuCPUUtilizationmetric. Enters theALARMstate when CPU utilization is greater than 90% for 1 data point within a 1 minute period.
Cluster management daemon heartbeat alarm
Starting with AWS ParallelCluster version 3.15.0, when Amazon CloudWatch logging is enabled and the Slurm scheduler is used, the following alarm is created:
-
— Monitors thecluster-name-HeadNode-ClustermgtdHeartbeatClustermgtdHeartbeatmetric in theParallelClusternamespace. The alarm enters theALARMstate when fewer than 1 heartbeat is received for 10 consecutive data points within a 1 minute period. Missing data is treated as breaching.
Note
All alarms recover symmetrically: the same data points and evaluation period that trigger the alarm also
govern recovery. For example, alarms with 1 data point recover after 1 good data point within the same observation
period, similarly the ClustermgtdHeartbeat alarm requires 10 consecutive good data points (10 minutes)
to return to OK.
Note
AWS ParallelCluster doesn't configure alarm actions. For information about how to set up alarm actions, such as sending notifications, see Alarm actions. For more information about Amazon CloudWatch alarms, see Using Amazon CloudWatch alarms in the Amazon CloudWatch User Guide.
For AWS ParallelCluster version 3.8.0 and later, disable alarms by setting Monitoring /
Alarms / Enabled
to false in your cluster configuration.
For AWS ParallelCluster versions before 3.8.0, disable alarms by setting Monitoring /
Dashboards / CloudWatch / Enabled to false in your
cluster configuration. Note that this setting also disables the Amazon CloudWatch dashboard. See Amazon CloudWatch dashboard for additional details.