Recommended CloudWatch alarms for Amazon OpenSearch Service
CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for
                some amount of time. For example, you might want AWS to email you if your cluster
                health status is red for longer than one minute. This section includes
                some recommended alarms for Amazon OpenSearch Service and how to respond to them.
You can automatically deploy these alarms using AWS CloudFormation. For a sample stack, see the
                related GitHub
                    repository
Note
If you deploy the CloudFormation stack, the KMSKeyError and
                        KMSKeyInaccessible alarms will exists in an Insufficient
                        Data state because these metrics only appear if a domain encounters a
                    problem with its encryption key.
For more information about configuring alarms, see Creating Amazon CloudWatch Alarms in the Amazon CloudWatch User Guide.
| Alarm | Issue | 
|---|---|
| ClusterStatus.redmaximum is >= 1 for 1 minute, 1
                                consecutive time | At least one primary shard and its replicas are not allocated to a node. See Red cluster status. | 
| ClusterStatus.yellowmaximum is >= 1 for 1 minute, 5
                                consecutive times | At least one replica shard is not allocated to a node. See Yellow cluster status. | 
| FreeStorageSpaceminimum is <= 20480 for 1
                                minute, 1 consecutive time | A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node. | 
| ClusterIndexWritesBlockedis >= 1 for 5 minutes, 1
                                consecutive time | Your cluster is blocking write requests. See ClusterBlockException. | 
| Nodesminimum is < x for 1 day, 1 consecutive time | x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. See Failed cluster nodes. | 
| AutomatedSnapshotFailuremaximum is >= 1 for 1
                                minute, 1 consecutive time | An automated snapshot failed. This failure is often the result of
                                a red cluster health status. See Red cluster status. For a summary of all automated snapshots and some information about failures, try one of the following requests:  | 
| CPUUtilizationorWarmCPUUtilizationmaximum is >= 80% for 15 minutes, 3 consecutive times | 100% CPU utilization might occur sometimes, but sustained high usage is problematic. Consider using larger instance types or adding instances. | 
| JVMMemoryPressuremaximum is >= 95% for 1 minute, 3
                                consecutive times | The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances. | 
| OldGenJVMMemoryPressuremaximum is >= 80% for 1
                                minute, 3 consecutive times | |
| MasterCPUUtilizationmaximum is >= 50% for 15
                                minutes, 3 consecutive times | Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower CPU usage than data nodes. | 
| MasterJVMMemoryPressuremaximum is >= 95% for 1
                                minute, 3 consecutive times | |
| MasterOldGenJVMMemoryPressuremaximum is >= 80% for
                                1 minute, 3 consecutive times | |
| KMSKeyErroris >= 1 for 1 minute, 1 consecutive
                                time | The AWS KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. For more information, see Encryption of data at rest for Amazon OpenSearch Service. | 
| KMSKeyInaccessibleis >= 1 for 1 minute, 1
                                consecutive time | The AWS KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to OpenSearch Service. You can't recover domains that are in this state. However, if you have a manual snapshot, you can use it to migrate to a new domain. To learn more, see Encryption of data at rest for Amazon OpenSearch Service. | 
| shards.activeis >= 30000 for 1 minute, 1
                                consecutive time | The total number of active primary and replica shards is greater than 30,000. You might be rotating your indexes too frequently. Consider using ISM to remove indexes once they reach a specific age. | 
| 5xxalarms >= 10%
                                ofOpenSearchRequests | One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture. | 
| MasterReachableFromNodemaximum is < 1 for 5
                                minutes, 1 consecutive time | This alarm indicates that the master node stopped or is unreachable. These failures are usually the result of a network connectivity issue or an AWS dependency problem. | 
| ThreadpoolWriteQueueaverage is >= 100 for 1 minute,
                                1 consecutive time | The cluster is experiencing high indexing concurrency. Review and control indexing requests, or increase cluster resources. | 
| ThreadpoolSearchQueueaverage is >= 500 for 1
                                minute, 1 consecutive time | The cluster is experiencing high search concurrency. Consider scaling your cluster. You can also increase the search queue size, but increasing it excessively can cause out of memory errors. | 
| ThreadpoolSearchQueuemaximum is >= 5000 for 1 minute,
                                1 consecutive time | |
| Increase in ThreadpoolSearchRejectedSUM is >=1{
                                math expression DIFF ( )} for 1 minute, 1 consecutive time | These alarms notify you of domain issues that might impact performance and stability. | 
| Increase in ThreadpoolWriteRejectedSUM is >=1{ math
                                expression DIFF ( )} for 1 minute, 1 consecutive time | 
Note
If you just want to view metrics, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch.
Other alarms you might consider
Consider configuring the following alarms depending on which OpenSearch Service features you regularly use.
| Alarm | Issue | 
|---|---|
| WarmFreeStorageSpaceis >= 10% | You have reached 10% of your total free warm storage. WarmFreeStorageSpacemeasures the sum of your
                                    free warm storage space in MiB. UltraWarm uses Amazon S3 rather than
                                    attached disks. | 
| HotToWarmMigrationQueueSizeis >= 20 for 1
                                    minute, 3 consecutive times | A high number of indexes are concurrently moving from hot to UltraWarm storage. Consider scaling your cluster. | 
| HotToWarmMigrationSuccessLatencyis >= 1 day, 1
                                    consecutive time | Configure this alarm so that you're notified if the
                                             | 
| WarmJVMMemoryPressuremaximum is >= 95% for 1
                                    minute, 3 consecutive times | The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances. | 
| WarmOldGenJVMMemoryPressuremaximum is >= 80%
                                    for 1 minute, 3 consecutive times | |
| WarmToColdMigrationQueueSizeis >= 20 for 1
                                    minute, 3 consecutive times | A high number of indexes are concurrently moving from UltraWarm to cold storage. Consider scaling your cluster. | 
| HotToWarmMigrationFailureCountis >= 1 for 1
                                    minute, 1 consecutive time | Migrations might fail during snapshots, shard relocations, or force merges. Failures during snapshots or shard relocation are typically due to node failures or S3 connectivity issues. Lack of disk space is usually the underlying cause of force merge failures. | 
| WarmToColdMigrationFailureCountis >= 1 for 1
                                    minute, 1 consecutive time | Migrations usually fail when attempts to migrate index metadata to cold storage fail. Failures can also happen when the warm index cluster state is being removed. | 
| WarmToColdMigrationLatencyis >= 1 day, 1
                                    consecutive time | Configure this alarm so that you're notified if the
                                             | 
| AlertingDegradedis >= 1 for 1 minute, 1
                                    consecutive time | Either the alerting index is red, or one or more nodes is not on schedule. | 
| ADPluginUnhealthyis >= 1 for 1 minute, 1
                                    consecutive time | The anomaly detection plugin isn't functioning properly, either because of high failure rates or because one of the indexes being used is red. | 
| AsynchronousSearchFailureRateis >= 1 for 1
                                    minute, 1 consecutive time | At least one asynchronous search failed in the last minute, which likely means the coordinator node failed. The lifecycle of an asynchronous search request is managed solely on the coordinator node, so if the coordinator goes down, the request fails. | 
| AsynchronousSearchStoreHealthis >= 1 for 1
                                    minute, 1 consecutive time | The health of the asynchronous search response store in the persisted index is red. You might be storing large asynchronous responses, which can destabilize a cluster. Try to limit your asynchronous search responses to 10 MB or less. | 
| SQLUnhealthyis >= 1 for 1 minute, 3 consecutive
                                    times | The SQL plugin is returning 5xx response codes or passing invalid query DSL to OpenSearch. Troubleshoot the requests that your clients are making to the plugin. | 
| LTRStatus.redis >= 1 for 1 minute, 1
                                    consecutive time | At least one of the indexes needed to run the Learning to Rank plugin has missing primary shards and isn't functional. |