Baseline alerts in monitoring and incident management for Amazon EKS in AMS Accelerate

After verifying the alerts, AMS enables the following alerts for Amazon EKS and then engages in monitoring and incident management for your selected Amazon EKS clusters. The response time Service Level Agreements (SLAs) and Service Level Objectives (SLOs) are dependent on your selected account Service Tier (Plus, Premium). For more information, see Incident reports and service requests in AMS Accelerate.

Alerts and actions

The following table lists the Amazon EKS alerts and respective actions that AMS takes:

Alert	Thresholds	Action
Container OOM killed	The total number of container restarts within the last 10 minutes is at least 1 and a Kubernetes container in a pod has been terminated with the reason “OOMKilled” within the last 10 minutes.	AMS investigates whether the OOM kill is caused because of reaching container limit or memory limit overcommit, and then advises you on corrective actions.
Pod Job Failed	A Kubernetes job fails to complete. Failure is indicated by the presence of at least one failed job status.	AMS investigates why the Kubernetes job or corresponding cron job is failing, and then advises you on corrective actions.
StatefulSet Down	The number of replicas ready to serve traffic doesn't match the current number of existing replicas per StatefulSet for at least 1 minute.	AMS determines why pods aren't ready by reviewing error messages in pod events and error log snippets in pod logs, and then advises you on corrective actions.
HPA Scaling Ability	The Horizontal Pod Autoscaler (HPA) can't scale due to the status condition “AbleToScale” being false for at least 2 minutes.	AMS determines which Kubernetes Horizontal Pod Autoscaler (HPA) is unable to scale pods for its subsequent workload resource, such as a Deployment or StatefulSet.
HPA Metric Availability	The Horizontal Pod Autoscaler (HPA) can't collect metrics due to the status condition “ScalingActive” being false for at least 2 minutes.	AMS determines why HPA can't collect metrics, such as metrics related to server configuration issues or RBAC authorization issues.
Pod Not Ready	A Kubernetes pod remains in a non-running state (such as Pending, Unknown, or Failed) for longer than 15 minutes.	AMS investigates affected pod(s) for details, reviews pod logs for related errors and events, and then advises you on corrective actions.
Pod Crash Looping	A pod container restarts at least once every 15 minutes for a 1-hour period.	AMS investigates the reasons for the pod not starting, such as insufficient resources, a file locked by another container, database locked by another container, service dependencies failing, DNS issues for external services, and misconfigurations.
Daemonset Mis-scheduled	There is at least one Kubernetes Daemonset pod misscheduled over a 10-minute period.	AMS determines why a Daemonset is scheduled on a node where they aren't supposed to run. This might happen when the wrong pod nodeSelector/taints/affinities were applied to the Daemonset pods or when node (node pools) were tainted and existing pods weren't scheduled for eviction.
Kubernetes API Errors	The Kubernetes API server error rate exceeds 3% over a 2-minute period.	AMS analyzes control plane logs to determine the volume and types of errors that are causing this alert, and identifies any resource contention issues for master node or etcd autoscaling groups. If the API server doesn't recover, AMS engages the Amazon EKS service team.
Kubernetes API Latency	The 99th percentile latency of requests to the Kubernetes API server exceeds 1 second over a 2-minute period.	AMS analyzes control plane logs to determine the volume and types of errors that are causing latency and identifies any resource contention issues for master node or etcd auto-scaling groups. If the API server doesn't recover, AMS engages the Amazon EKS service team.
Kubernetes Client Cert Expiring	The client certificate used to authenticate to the Kubernetes API server is expiring in less than 24 hours.	AMS sends this notification to inform you that your cluster certificate will expire in 24 hours.
Node Not Ready	The Node “Ready” condition status is false for at least 10 minutes.	AMS investigates the node conditions and events, such as network issues, that prevent kubelet access to the API server.
Node High CPU	The CPU load exceeds 80% over 5-minute period.	AMS determines whether one or more pods are consuming an unusually high amount of CPU. Then, AMS verifies with you that your requests, limits, and pod activity are as expected.
Node OOM Kill Detected	There is at least one host OOM kill reported by the node in a 4-minute window.	AMS determines if the OOM kill is caused because of reaching the container limit or node overcommit. If the application activity is normal, AMS advises you on requests and limits for overcommits and revising pod limits.
Node Conntrack Limit	The ratio of the current number of connection tracking entries to the maximum limit exceeds 80% over a 5-minute period.	AMS advises you on the recommended conntrack value per core. Kubernetes nodes set the conntrack max value proportional to the total memory capacity of the node. High load applications, especially on smaller nodes, can easily exceed the conntrack max value, resulting in connection resets and timeouts.
Node Clock Not in Sync	The minimum synchronization status over a 2-minute period is 0, and the maximum error in seconds is 16 or higher.	AMS determines whether Network Time Protocol (NTP) is installed and functioning properly.
Pod High CPU	CPU usage of a container exceeds 80% over 3-minute rate for a minimum of 2-minute period.	AMS investigates pod logs to determine the pod tasks that consume a high amount of CPU.
Pod High Memory	Memory usage of a container exceeds 80% of its specified memory limit over 2-minute period.	AMS investigates pod logs to determine the pod tasks that consume a high amount of memory.
CoreDNS Down	CoreDNS has disappeared from Prometheus target discovery for more than 15-minutes.	This is a critical alert that indicates that the domain name resolution for internal or external cluster services stopped. AMS checks the status of the CoreDNS pods, verifies CoreDNS configuration, verifies DNS endpoints that point to CoreDNS pods, verifies CoreDNS limits, and with your approval, enables CoreDNS debug logging.
CoreDNS Errors	CoreDNS returns SERVFAIL errors for more than 3% of DNS requests over a 10-minute period.	This alert might signal an issue with an application or a misconfiguration. AMS checks the status of the CoreDNS pods, verifies CoreDNS configuration, verifies DNS endpoints that point to CoreDNS pods, verifies CoreDNS limits, and with your approval, enables CoreDNS debug logging.
CoreDNS Latency	The 99th percentile of DNS request durations exceed 4 seconds for 10 minutes.	This alert Signals that CoreDNS might be overloaded. AMS checks the status of CoreDNS pods, verifies CoreDNS configuration, verifies DNS endpoints that point to the CoreDNS pods, verifies CoreDNS limits, and with your approval, enables CoreDNS debug logging.
CoreDNS Forwarding Latency	The 99th percentile of the response time for CoreDNS forward requests to kube-dns exceeds 4 seconds over a 10-minute period.	When CoreDNS isn't the authoritative server or doesn't have a cache entry for a domanin name, CoreDNS forwards the DNS request to an upstream DNS server. This alert signals that CoreDNS might be overloaded or there might be an issue with an upstream DNS server. AMS checks the status of CoreDNS pods, verifies CoreDNS configuration, verifies DNS endpoints that point to CoreDNS pods, verifies CoreDNS limits, and with your approval, enables CoreDNS debug logging.
CoreDNS Forwarding Error	More than 3% of DNS queries are failing over a 5-minute period.	When CoreDNS isn't the authoritative server or doesn't have a cache entry for a domanin name, CoreDNS forwards the DNS request to an upstream DNS server. This alert signals a possible misconfiguration or an issue with an upstream DNS server. AMS checks the status of CoreDNS pods, verifies CoreDNS configuration, verifies DNS endpoints that point to CoreDNS pods, verifies CoreDNS limits, and with your approval, enables CoreDNS debug logging.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

How monitoring and incident management for Amazon EKS works

Requirements