View a markdown version of this page

Reliability and performance monitoring for AWS CloudHSM - AWS Prescriptive Guidance

Reliability and performance monitoring for AWS CloudHSM

You can use Amazon CloudWatch Logs to monitor your AWS CloudHSM cluster in near real time. Using CloudWatch metrics, you can configure CloudWatch alarms to alert you if any of these metrics exceed their defined thresholds. For more information, see Working with Amazon CloudWatch Logs and AWS CloudHSM Audit Logs and Getting CloudWatch metrics for AWS CloudHSM in the AWS CloudHSM documentation.

The section describes how to configure alarms for the following metrics, which can help you monitor the reliability status of AWS CloudHSM clusters and hardware security modules (HSMs):

Unhealthy HSM instance (Recommended)

The HsmUnhealthy metric indicates that the HSM instance is not performing properly. The baseline value for this metric is zero. If the metric is greater than zero, it means that one or more HSMs in the cluster are not working as expected. AWS CloudHSM automatically replaces unhealthy instances for you. However, all the requests that were sent to the HSM after it started behaving unexpectedly and before it is marked as unhealthy will fail.

Creating an alarm on this metric helps you validate that the unhealthy HSM instance has been successfully replaced. It also provides insights about application-reported errors that might be the result of the unhealthy HSM.

If you receive an alarm for this metric, monitor the application to make sure that it can handle failure for short duration and validate that it is still working as expected after the HSM is replaced.

The following table shows the configuration values for this alarm. For instructions about how to set up this alarm, see Create a CloudWatch alarm based on a static threshold in the CloudWatch Logs documentation.

Property Value
Metric

HsmUnhealthy

Namespace

AWS/CloudHSM

Dimension

HSM ID and cluster ID

Statistic

Maximum

Threshold type

Static

Whenever duration is

Greater/Equal

Than

1

Note

You cannot make an HSM unhealthy in order to test the alarm or the application performance. However, you can simulate an HSM failure by blocking and unblocking the traffic between the application and the HSM for short amount of time. To block this traffic, you can modify your security groups or network access controls lists (Network ACLs).

HSM temperature

The HsmTemperature metric denotes the junction temperature of the hardware processor. The HSM becomes unhealthy if the temperature reaches 110 degrees Centigrade. An alarm for this metric can help you anticipate whether an HSM will become unhealthy.

The following table shows the configuration values for this alarm. For instructions about how to set up this alarm, see Create a CloudWatch alarm based on a static threshold in the CloudWatch Logs documentation.

Property Value
Metric

HsmTemperature

Namespace

AWS/CloudHSM

Dimension

HSM ID and cluster ID

Statistic

Maximum

Threshold type

Static

Whenever duration is

Greater/Equal

Than

90