Reliability and performance monitoring for AWS CloudHSM
You can use Amazon CloudWatch Logs to monitor your AWS CloudHSM cluster in near real time. Using CloudWatch metrics, you can configure CloudWatch alarms to alert you if any of these metrics exceed their defined thresholds. For more information, see Working with Amazon CloudWatch Logs and AWS CloudHSM Audit Logs and Getting CloudWatch metrics for AWS CloudHSM in the AWS CloudHSM documentation.
The section describes how to configure alarms for the following metrics, which can help you monitor the reliability status of AWS CloudHSM clusters and hardware security modules (HSMs):
Unhealthy HSM instance (Recommended)
The HsmUnhealthy metric indicates that the HSM instance is not performing
properly. The baseline value for this metric is zero. If the metric is greater than
zero, it means that one or more HSMs in the cluster are not working as expected. AWS CloudHSM
automatically replaces unhealthy instances for you. However, all the requests that were
sent to the HSM after it started behaving unexpectedly and before it is marked as
unhealthy will fail.
Creating an alarm on this metric helps you validate that the unhealthy HSM instance has been successfully replaced. It also provides insights about application-reported errors that might be the result of the unhealthy HSM.
If you receive an alarm for this metric, monitor the application to make sure that it can handle failure for short duration and validate that it is still working as expected after the HSM is replaced.
The following table shows the configuration values for this alarm. For instructions about how to set up this alarm, see Create a CloudWatch alarm based on a static threshold in the CloudWatch Logs documentation.
| Property | Value |
|---|---|
| Metric |
|
| Namespace |
|
| Dimension |
|
| Statistic |
|
| Threshold type |
|
| Whenever duration is |
|
| Than |
|
Note
You cannot make an HSM unhealthy in order to test the alarm or the application performance. However, you can simulate an HSM failure by blocking and unblocking the traffic between the application and the HSM for short amount of time. To block this traffic, you can modify your security groups or network access controls lists (Network ACLs).
HSM temperature
The HsmTemperature metric denotes the junction temperature of the
hardware processor. The HSM becomes unhealthy if the temperature reaches
110 degrees Centigrade. An alarm for this metric can help you anticipate whether an
HSM will become unhealthy.
The following table shows the configuration values for this alarm. For instructions about how to set up this alarm, see Create a CloudWatch alarm based on a static threshold in the CloudWatch Logs documentation.
| Property | Value |
|---|---|
| Metric |
|
| Namespace |
|
| Dimension |
|
| Statistic |
|
| Threshold type |
|
| Whenever duration is |
|
| Than |
|