# Amazon EKS and Kubernetes Container Insights with enhanced observability metrics


The following tables list the metrics and dimensions that Container Insights with enhanced observability collects for Amazon EKS and Kubernetes. These metrics are in the `ContainerInsights` namespace. For more information, see [Metrics](cloudwatch_concepts.md#Metric).

If you do not see any Container Insights with enhanced observability metrics in your console, be sure that you have completed the setup of Container Insights with enhanced observability. Metrics do not appear before Container Insights with enhanced observability has been set up completely. For more information, see [Setting up Container Insights](deploy-container-insights.md).

If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the **Metric Name** column of the table to see which metrics are not collected for Windows.

With the earlier version of Container Insights which delivers aggregated metrics at Cluster and Service level, the metrics are charged as custom metrics. With Container Insights with enhanced observability for Amazon EKS, Container Insights metrics are charged per observation instead of being charged per metric stored or log ingested. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/). 

**Note**  
On Windows, network metrics such as `pod_network_rx_bytes` and `pod_network_tx_bytes` are not collected for host process containers.  
On RedHat OpenShift on AWS (ROSA) clusters , diskio metrics such as `node_diskio_io_serviced_total` and `node_diskio_io_service_bytes_total` are not collected.


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `cluster_failed_node_count`  |  `ClusterName`  |  The number of failed worker nodes in the cluster. A node is considered failed if it is suffering from any *node conditions*. For more information, see [Conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) in the Kubernetes documentation.  | 
|  `cluster_node_count`  |  `ClusterName`  |  The total number of worker nodes in the cluster.  | 
|  `namespace_number_of_running_pods`  |  `Namespace` `ClusterName` `ClusterName`  |  The number of pods running per namespace in the resource that is specified by the dimensions that you're using.  | 
|  `node_cpu_limit`  |  `ClusterName`  `ClusterName`, `InstanceId`, `NodeName`   |  The maximum number of CPU units that can be assigned to a single node in this cluster.  | 
|  `node_cpu_reserved_capacity`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker. Formula: `node_cpu_request / node_cpu_limit`  `node_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `node_cpu_usage_total`  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  The number of CPU units being used on the nodes in the cluster.  | 
|  `node_cpu_utilization`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The total percentage of CPU units being used on the nodes in the cluster. Formula: `node_cpu_usage_total / node_cpu_limit`  | 
|  `node_filesystem_utilization`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The total percentage of file system capacity being used on nodes in the cluster. Formula: `node_filesystem_usage / node_filesystem_capacity`  `node_filesystem_usage` and `node_filesystem_capacity` are not reported directly as metrics, but are fields in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `node_memory_limit`  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster.  | 
|  `node_filesystem_inodes`  It is not available on Windows.  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The total number of inodes (used and unused) on a node.  | 
|  `node_filesystem_inodes_free` It is not available on Windows.  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The number of unused inodes on a node.  | 
|  `node_gpu_limit` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  The total number of GPU(s) available on the node.  | 
|  `node_gpu_usage_total` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  The number of GPU(s) being used by the running pods on the node.  | 
|  `node_gpu_reserved_capacity` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  | 
|  `node_memory_reserved_capacity`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The percentage of memory currently being used on the nodes in the cluster. Formula: `node_memory_request / node_memory_limit`  `node_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `node_memory_utilization`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation. Formula: `node_memory_working_set / node_memory_limit`.   | 
|  `node_memory_working_set`  |  `ClusterName`  `ClusterName`, `InstanceId`, `NodeName`   |  The amount of memory, in bytes, being used in the working set of the nodes in the cluster.  | 
|  `node_network_total_bytes`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The total number of bytes per second transmitted and received over the network per node in a cluster. Formula: `node_network_rx_bytes + node_network_tx_bytes`  `node_network_rx_bytes` and `node_network_tx_bytes` are not reported directly as metrics, but are fields in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `node_number_of_running_containers`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The number of running containers per node in a cluster.  | 
|  `node_number_of_running_pods`  |  `NodeName`, `ClusterName`, `InstanceId` `ClusterName`  |  The number of running pods per node in a cluster.  | 
|  `node_status_allocatable_pods`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds.  | 
|  `node_status_capacity_pods`  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The number of pods that can be assigned to a node based on its capacity.  | 
|  `node_status_condition_ready`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  Indicates whether the node status condition `Ready` is true for Amazon EC2 nodes.  | 
|  `node_status_condition_memory_pressure`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  Indicates whether the node status condition `MemoryPressure` is true.  | 
|  `node_status_condition_pid_pressure`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  Indicates whether the node status condition `PIDPressure` is true.  | 
|  `node_status_condition_disk_pressure`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  Indicates whether the node status condition `OutOfDisk` is true.  | 
|  `node_status_condition_unknown`   |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  Indicates whether any of the node status conditions are Unknown.  | 
|  `node_interface_network_rx_dropped`  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The number of packets which were received and subsequently dropped by a network interface on the node.  | 
|  `node_interface_network_tx_dropped`  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The number of packets which were due to be transmitted but were dropped by a network interface on the node.  | 
|  `node_diskio_io_service_bytes_total`  It is not available on Windows or on ROSA clusters.  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The total number of bytes transferred by all I/O operations on the node.  | 
|  `node_diskio_io_serviced_total` It is not available on Windows or on ROSA clusters.  |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`   |  The total number of I/O operations on the node.  | 
|  `pod_cpu_reserved_capacity`  |  `PodName`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  `ClusterName`, `Namespace`, `Service`   |  The CPU capacity that is reserved per pod in a cluster. Formula: `pod_cpu_request / node_cpu_limit`  `pod_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_cpu_utilization`  |  `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The percentage of CPU units being used by pods. Formula: `pod_cpu_usage_total / node_cpu_limit`  | 
|  `pod_cpu_utilization_over_pod_limit`  |  `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The percentage of CPU units being used by pods relative to the pod limit. Formula: `pod_cpu_usage_total / pod_cpu_limit`  | 
|  `pod_memory_reserved_capacity`  |  `PodName`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  `ClusterName`, `Namespace`, `Service`   |  The percentage of memory that is reserved for pods. Formula: `pod_memory_request / node_memory_limit`  `pod_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_memory_utilization`  |  `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The percentage of memory currently being used by the pod or pods. Formula: `pod_memory_working_set / node_memory_limit`  | 
|  `pod_memory_utilization_over_pod_limit`  |  `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: `pod_memory_working_set / pod_memory_limit`  | 
|  `pod_network_rx_bytes`  |  `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The number of bytes per second being received over the network by the pod. Formula: `sum(pod_interface_network_rx_bytes)`  `pod_interface_network_rx_bytes` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_network_tx_bytes`  |  `PodName`, `Namespace`, `ClusterName` `Namespace,` `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`   |  The number of bytes per second being transmitted over the network by the pod. Formula: `sum(pod_interface_network_tx_bytes)`  `pod_interface_network_tx_bytes` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_cpu_request`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The CPU requests for the pod. Formula: `sum(container_cpu_request)`  `pod_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_memory_request`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The memory requests for the pod. Formula: `sum(container_memory_request)`  `pod_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_cpu_limit`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear.  Formula: `sum(container_cpu_limit)`  `pod_cpu_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_memory_limit`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.  Formula: `sum(container_memory_limit)`  `pod_cpu_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `pod_status_failed`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system.   | 
|  `pod_status_ready`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that all containers in the pod are ready, having reached the condition of `ContainerReady`.   | 
|  `pod_status_running`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that all containers in the pod are running.   | 
|  `pod_status_scheduled`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that the pod has been scheduled to a node.   | 
|  `pod_status_unknown`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that status of the pod can't be obtained.   | 
|  `pod_status_pending`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet.   | 
|  `pod_status_succeeded`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Indicates that all containers in the pod have successfully terminated and will not be restarted.   | 
|  `pod_number_of_containers`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers defined in the pod specification.   | 
|  `pod_number_of_running_containers`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are currently in the `Running` state.   | 
|  `pod_container_status_terminated`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are in the `Terminated` state.   | 
|  `pod_container_status_running`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are in the `Running` state.   | 
|  `pod_container_status_waiting`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are in the `Waiting` state.   | 
|  `pod_container_status_waiting_reason_crash_loop_back_off`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are pending because of a `CrashLoopBackOff` error, where a container repeatedly fails to start.  | 
|  `pod_container_status_waiting_reason_create_container_config_error`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are pending with the reason `CreateContainerConfigError`. This is because of an error while creating the container configuration.  | 
|  `pod_container_status_waiting_reason_create_container_error`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are pending with the reason `CreateContainerError` because of an error while creating the container.  | 
|  `pod_container_status_waiting_reason_image_pull_error`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are pending because of `ErrImagePull`, `ImagePullBackOff`, or `InvalidImageName`. These situations are because of an error while pulling the container image.  | 
|  `pod_container_status_waiting_reason_start_error`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  Reports the number of containers in the pod which are pending with the reason being `StartError` because of an error while starting the container.  | 
|  `pod_container_status_terminated_reason_oom_killed`   |  `ContainerName`, `FullPodName`, `PodName`, `Namespace`, `ClusterName` `ContainerName`, `PodName`, `Namespace`, `ClusterName` `ClusterName`  |  Indicates a pod was terminated for exceeding the memory limit. This metric is only displayed when this issue occurs.  | 
|  `pod_interface_network_rx_dropped`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The number of packets which were received and subsequently dropped a network interface for the pod.   | 
|  `pod_interface_network_tx_dropped`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  |  The number of packets which were due to be transmitted but were dropped for the pod.   | 
| `pod_memory_working_set` |  `ClusterName` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  | The memory in bytes that is currently being used by a pod. | 
| `pod_cpu_usage_total` |  `ClusterName` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`  | The number of CPU units used by a pod. | 
|  `container_cpu_utilization`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`  |  The percentage of CPU units being used by the container. Formula: `container_cpu_usage_total / node_cpu_limit`  `container_cpu_utilization` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `container_cpu_utilization_over_container_limit`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`  |  The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear. Formula: `container_cpu_usage_total / container_cpu_limit`  `container_cpu_utilization_over_container_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `container_memory_utilization`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`  |  The percentage of memory units being used by the container. Formula: `container_memory_working_set / node_memory_limit`  `container_memory_utilization` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `container_memory_utilization_over_container_limit`    |  `ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`  |  The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear. Formula: `container_memory_working_set / container_memory_limit`  `container_memory_utilization_over_container_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see [Relevant fields in performance log events for Amazon EKS and Kubernetes](Container-Insights-reference-performance-entries-EKS.md).    | 
|  `container_memory_failures_total`  It is not available on Windows.  |  `ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`  |  The number of memory allocation failures experienced by the container.  | 
|  `pod_number_of_container_restarts`  |  PodName, `Namespace`, `ClusterName`  |  The total number of container restarts in a pod.  | 
|  `service_number_of_running_pods`  |  Service, `Namespace`, `ClusterName` `ClusterName`  |  The number of pods running the service or services in the cluster.  | 
|  `replicas_desired`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName`  |  The number of pods desired for a workload as defined in the workload specification.  | 
|  `replicas_ready`   |  `ClusterName` `PodName`, `Namespace`, `ClusterName`  |  The number of pods for a workload that have reached the ready status.  | 
|  `status_replicas_available`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName`  |  The number of pods for a workload which are available. A pod is available when it has been ready for the `minReadySeconds` defined in the workload specification.  | 
|  `status_replicas_unavailable`  |  `ClusterName` `PodName`, `Namespace`, `ClusterName`  |  The number of pods for a workload which are unavailable. A pod is available when it has been ready for the `minReadySeconds` defined in the workload specification. Pods are unavailable if they have not met this criterion.  | 
|  `apiserver_storage_objects`  |  `ClusterName` `ClusterName`, `resource`  |  The number of objects stored in etcd at the time of the last check.  | 
|  `apiserver_storage_db_total_size_in_bytes`  |  `ClusterName` `ClusterName`, `endpoint`  |  Total size of the storage database file physically allocated in bytes. This metric is experimental and might change in future releases of Kubernetes. Unit: Bytes Meaningful statistics: Sum, Average, Minimum, Maximum  | 
|  `apiserver_request_total`  |  `ClusterName` `ClusterName`, `code`, `verb`  |  The total number of API requests to the Kubernetes API server.  | 
|  `apiserver_request_duration_seconds`  |  `ClusterName` `ClusterName`, `verb`  |  Responce latency for API requests to the Kubernetes API server.  | 
|  `apiserver_admission_controller_admission_duration_seconds`  |  `ClusterName` `ClusterName`, `operation`  |  Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server.  | 
|  `rest_client_request_duration_seconds`   |  `ClusterName` `ClusterName`, `operation`  |  Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `rest_client_requests_total`   |  `ClusterName` `ClusterName`, `code`, `method`  |  The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `etcd_request_duration_seconds`   |  `ClusterName` `ClusterName`, `operation`  |  Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `apiserver_storage_size_bytes`   |  `ClusterName` `ClusterName`, `endpoint`  |  Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `apiserver_longrunning_requests`  |  `ClusterName` `ClusterName`, `resource`  |  The number of active long-running requests to the Kubernetes API server.  | 
|  `apiserver_current_inflight_requests`  |  `ClusterName` `ClusterName`, `request_kind`  |  The number of requests that are being processed by Kubernetes API server.  | 
|  `apiserver_admission_webhook_admission_duration_seconds`  |  `ClusterName` `ClusterName`, `name`  |  Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them.  | 
|  `apiserver_admission_step_admission_duration_seconds`   |  `ClusterName` `ClusterName`, `operation`  |  Admission sub-step latency in seconds.  | 
|  `apiserver_requested_deprecated_apis`   |  `ClusterName` `ClusterName`, `group`  |  Number of requests to deprecated APIs on the Kubernetes API server.  | 
|  `apiserver_request_total_5xx`  |  `ClusterName` `ClusterName`, `code`, `verb`  |  Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code.  | 
|  `apiserver_storage_list_duration_seconds`   |  `ClusterName` `ClusterName`, `resource`  |  Response latency of listing objects from Etc. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `apiserver_flowcontrol_request_concurrency_limit`   |  `ClusterName` `ClusterName`, `priority_level`  |  The number of threads used by the currently executing requests in the API Priority and Fairness subsystem.  | 
|  `apiserver_flowcontrol_rejected_requests_total`   |  `ClusterName` `ClusterName`, `reason`  |  Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes.  | 
|  `apiserver_current_inqueue_requests`   |  `ClusterName` `ClusterName`, `request_kind`  |  The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.  | 

## NVIDIA GPU metrics


Beginning with version `1.300034.0` of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version `v1.3.0-eksbuild.1` or later. For more information, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md). These NVIDIA GPU metrics that are collected are listed in the table in this section. 

For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:
+ You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version `v1.3.0-eksbuild.1` or later.
+ [The NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin) must be installed in the cluster.
+ [The NVIDIA container toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

You can opt out of collecting NVIDIA GPU metrics by setting the `accelerated_compute_metrics` option in the beginn CloudWatch agent configuration file to `false`. For more information and an example opt-out configuration, see [(Optional) Additional configuration](install-CloudWatch-Observability-EKS-addon.md#install-CloudWatch-Observability-EKS-addon-configuration).


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `container_gpu_memory_total` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The total frame buffer size, in bytes, on the GPU(s) allocated to the container.  | 
|  `container_gpu_memory_used` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The bytes of frame buffer used on the GPU(s) allocated to the container.  | 
|  `container_gpu_memory_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The percentage of frame buffer used of the GPU(s) allocated to the container.  | 
|  `container_gpu_power_draw` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The power usage in watts of the GPU(s) allocated to the container.  | 
|  `container_gpu_temperature` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The temperature in degrees celsius of the GPU(s) allocated to the container.  | 
|  `container_gpu_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The percentage utilization of the GPU(s) allocated to the container.  | 
|  `container_gpu_tensor_core_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The percentage utilization of the tensor cores on the GPU(s) allocated to the container.  | 
|  `node_gpu_memory_total` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The total frame buffer size, in bytes, on the GPU(s) allocated to the node.  | 
|  `node_gpu_memory_used` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The bytes of frame buffer used on the GPU(s) allocated to the node.  | 
|  `node_gpu_memory_utilization` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The percentage of frame buffer used on the GPU(s) allocated to the node.  | 
|  `node_gpu_power_draw` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The power usage in watts of the GPU(s) allocated to the node.  | 
|  `node_gpu_temperature` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The temperature in degrees celsius of the GPU(s) allocated to the node.  | 
|  `node_gpu_utilization` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The percentage utilization of the GPU(s) allocated to the node.  | 
|  `node_gpu_tensor_core_utilization` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`  |  The percentage utilization of the tensor cores on the GPU(s) allocated to the node.  | 
|  `pod_gpu_memory_total` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`  |  The total frame buffer size, in bytes, on the GPU(s) allocated to the pod.  | 
|  `pod_gpu_memory_used` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`  |  The bytes of frame buffer used on the GPU(s) allocated to the pod.  | 
|  `pod_gpu_memory_utilization` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`  |  The percentage of frame buffer used of the GPU(s) allocated to the pod.  | 
|  `pod_gpu_power_draw` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`  |  The power usage in watts of the GPU(s) allocated to the pod.  | 
|  `pod_gpu_temperature` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`  |  The temperature in degrees Celsius of the GPU(s) allocated to the pod.  | 
|  `pod_gpu_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The percentage utilization of the GPU(s) allocated to the pod.  | 
|  `pod_gpu_tensor_core_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`  |  The percentage utilization of the tensor cores on the GPU(s) allocated to the pod.  | 

### Detailed GPU monitoring


Beginning with version `1.300062.0` of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS supports detailed GPU monitoring with sub-minute collection intervals. This addresses monitoring gaps for short-duration machine learning inference workloads that may be completely missed by standard collection intervals. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version `v4.7.0-eksbuild.1` or later. For more information, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md).

By default, GPU metrics are collected and ingested at 60-second intervals. With detailed monitoring enabled, the CloudWatch agent collect GPU metrics at sub-minute intervals (minimum 1 second), but metrics are still ingested to CloudWatch at 1-minute intervals. However, you can query statistical aggregations (such as minimum, maximum, and percentiles like p90) of the sub-minute datapoints within each 1-minute period, providing accurate GPU utilization data and better resource optimization.

#### Configuration


To enable detailed GPU monitoring, update your CloudWatch agent configuration to include the `accelerated_compute_gpu_metrics_collection_interval` parameter in the `kubernetes` section, as in the following example.

```
{  
    "logs": {  
        "metrics_collected": {  
            "kubernetes": {  
                "cluster_name": "MyCluster",  
                "enhanced_container_insights": true,  
                "accelerated_compute_metrics": true,  
                "accelerated_compute_gpu_metrics_collection_interval": 1  
            }  
        }  
    }  
}
```

The `accelerated_compute_gpu_metrics_collection_interval` parameter accepts values in seconds, with a minimum value of 1 second. Setting it to `1` enables 1-second collection intervals. If this parameter is not specified, the default 60-second interval is used.

For complete configuration instructions, see [Setting up the CloudWatch agent to collect cluster metrics](Container-Insights-setup-metrics.md).

## AWS Neuron metrics for AWS Trainium and AWS Inferentia


Beginning with version `1.300036.0` of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects accelerated computing metrics from AWS Trainium and AWS Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version `v1.5.0-eksbuild.1` or later. For more information about the add-on, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md). For more information about AWS Trainium, see [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/). For more information about AWS Inferentia, see [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/).

For Container Insights to collect AWS Neuron metrics, you must meet the following prerequisites:
+ You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version `v1.5.0-eksbuild.1` or later.
+ The [Neuron driver](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22) must be installed on the nodes of the cluster.
+ The [Neuron device plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html) must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

The metrics that are collected are listed in the table in this section. The metrics are collected for AWS Trainium, AWS Inferentia, and AWS Inferentia2.

The CloudWatch agent collects these metrics from the [Neuron monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and does the necessary Kubernetes resource correlation to deliver metrics at the pod and container levels


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `container_neuroncore_utilization` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  NeuronCore utilization, during the captured period of the NeuronCore allocated to the container. Unit: Percent  | 
|  `container_neuroncore_memory_usage_constants` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference). Unit: Bytes  | 
|  `container_neuroncore_memory_usage_model_code` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container. Unit: Bytes  | 
|  `container_neuroncore_memory_usage_model_shared_scratchpad` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models. Unit: Bytes  | 
|  `container_neuroncore_memory_usage_runtime_memory` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container. Unit: Bytes  | 
|  `container_neuroncore_memory_usage_tensors` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for tensors by the NeuronCore allocated to the container. Unit: Bytes  | 
|  `container_neuroncore_memory_usage_total` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`  |  The total amount of memory used by the NeuronCore allocated to the container. Unit: Bytes  | 
|  `container_neurondevice_hw_ecc_events_total` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`  |  The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count  | 
|  `pod_neuroncore_utilization` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod. Unit: Percent  | 
|  `pod_neuroncore_memory_usage_constants` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference). Unit: Bytes  | 
|  `pod_neuroncore_memory_usage_model_code` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod. Unit: Bytes  | 
|  `pod_neuroncore_memory_usage_model_shared_scratchpad` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models. Unit: Bytes  | 
|  `pod_neuroncore_memory_usage_runtime_memory` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod. Unit: Bytes  | 
|  `pod_neuroncore_memory_usage_tensors` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for tensors by the NeuronCore allocated to the pod. Unit: Bytes  | 
|  `pod_neuroncore_memory_usage_total` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`  |  The total amount of memory used by the NeuronCore allocated to the pod. Unit: Bytes  | 
|  `pod_neurondevice_hw_ecc_events_total` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`  |  The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod. Unit: Bytes  | 
|  `node_neuroncore_utilization` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The NeuronCore utilization during the captured period of the NeuronCore allocated to the node. Unit: Percent  | 
|  `node_neuroncore_memory_usage_constants` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference). Unit: Bytes  | 
|  `node_neuroncore_memory_usage_model_code` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node. Unit: Bytes  | 
|  `node_neuroncore_memory_usage_model_shared_scratchpad` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models. Unit: Bytes  | 
|  `node_neuroncore_memory_usage_runtime_memory` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node. Unit: Bytes  | 
|  `node_neuroncore_memory_usage_tensors` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The amount of device memory used for tensors by the NeuronCore that is allocated to the node. Unit: Bytes  | 
|  `node_neuroncore_memory_usage_total` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`  |  The total amount of memory used by the NeuronCore that is allocated to the node. Unit: Bytes  | 
|  `node_neuron_execution_errors_total` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName`  |  The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating the errors of the following types: `generic`, `numerical`, `transient`, `model`, `runtime`, and `hardware` Unit: Count  | 
|  `node_neurondevice_runtime_memory_used_bytes` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName`  |  The total Neuron device memory usage in bytes on the node. Unit: Bytes  | 
| `node_neuron_execution_latency` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName`  |  In seconds, the latency for an execution on the node as measured by the Neuron runtime. Unit: Seconds  | 
| `node_neurondevice_hw_ecc_events_total` |  `ClusterName` `ClusterName`, `UltraServer` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `NodeName`, `NeuronDevice`  |  The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count  | 

## AWS Elastic Fabric Adapter (EFA) metrics


Beginning with version `1.300037.0` of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects AWS Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version `v1.5.2-eksbuild.1` or later. For more information about the add-on, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md). For more information about AWS Elastic Fabric Adapter, see [Elastic Fabric Adapter](https://aws.amazon.com/hpc/efa/).

For Container Insights to collect AWS Elastic Fabric adapter metrics, you must meet the following prerequisites:
+ You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version `v1.5.2-eksbuild.1` or later.
+ The EFA device plugin must be installed on the cluster. For more information, see [aws-efa-k8s-device-plugin](https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin) on GitHub.

The metrics that are collected are listed in the following table. 


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `container_efa_rx_bytes` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of bytes per second received by the EFA device(s) allocated to the container. Unit: Bytes/Second  | 
|  `container_efa_tx_bytes` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of bytes per second transmitted by the EFA device(s) allocated to the container. Unit: Bytes/Second  | 
|  `container_efa_rx_dropped` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of packets that were received and then dropped by the EFA device(s) allocated to the container. Unit: Count/Second  | 
|  `container_efa_rdma_read_bytes` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second  | 
|  `container_efa_rdma_write_bytes` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second  | 
|  `container_efa_rdma_write_recv_bytes` |  `ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NetworkInterfaceId`   |  The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container. Unit: Bytes/Second  | 
|  `pod_efa_rx_bytes` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of bytes per second received by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `pod_efa_tx_bytes` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of bytes per second transmitted by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `pod_efa_rx_dropped` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of packets that were received and then dropped by the EFA device(s) allocated to the pod. Unit: Count/Second  | 
|  `pod_efa_rdma_read_bytes` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `pod_efa_rdma_write_bytes` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `pod_efa_rdma_write_recv_bytes` |  `ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NetworkInterfaceId`  |  The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `node_efa_rx_bytes` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of bytes per second received by the EFA device(s) allocated to the node. Unit: Bytes/Second  | 
|  `node_efa_tx_bytes` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of bytes per second transmitted by the EFA device(s) allocated to the node. Unit: Bytes/Second  | 
|  `node_efa_rx_dropped` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of packets that were received and then dropped by the EFA device(s) allocated to the node. Unit: Count/Second  | 
|  `node_efa_rdma_read_bytes` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node. Unit: Bytes/Second  | 
|  `node_efa_rdma_write_bytes` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second  | 
|  `node_efa_rdma_write_recv_bytes` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `NetworkInterfaceId`  |  The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node. Unit: Bytes/Second  | 

## Amazon SageMaker AI HyperPod metrics


Beginning with version `v2.0.1-eksbuild.1` of the CloudWatch Observability EKS add-on, Container Insights with enhanced observability for Amazon EKS automatically collects Amazon SageMaker AI HyperPod metrics from Amazon EKS clusters. For more information about the add-on, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md). For more information about Amazon SageMaker AI HyperPod, see [Amazon SageMaker AI HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html).

The metrics that are collected are listed in the following table. 


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `hyperpod_node_health_status_unschedulable` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  Indicates if a node is labeled as `Unschedulable` by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and is not available for running workloads. Unit: Count  | 
|  `hyperpod_node_health_status_schedulable` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  Indicates if a node is labeled as `Schedulable` by Amazon SageMaker AI HyperPod. This means that the node has passed basic health checks or deep health checks and is available for running workloads. Unit: Count  | 
|  `hyperpod_node_health_status_unschedulable_pending_replacement` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  Indicates if a node is labeled as `UnschedulablePendingReplacement` by HyperPod. This means that the node has failed deep health checks or health monitoring agent checks and requires a replacement. If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker AI HyperPod. Unit: Count  | 
|  `hyperpod_node_health_status_unschedulable_pending_reboot` |  `ClusterName` `ClusterName`, `InstanceId`, `NodeName`  |  Indicates if a node is labeled as `UnschedulablePendingReboot` by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and requires a reboot. If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker AI HyperPod. Unit: Count  | 

## Amazon EBS NVMe driver metrics


Beginning with version ` 1.300056.0` of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS automatically collects Amazon EBS NVMe driver metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability Amazon EKS add-on version `4.1.0` or later. For more information about the add-on, see [Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart](install-CloudWatch-Observability-EKS-addon.md). For more information about Amazon EBS, see [Amazon EBS detailed performance statistics](https://docs.aws.amazon.com/ebs/latest/userguide/nvme-detailed-performance-stats.html).

For Container Insights to collect Amazon EBS NVMe driver metrics, you must meet the following prerequisites:
+ You must be using Container Insights with enhanced observability for Amazon EKS, with the CloudWatch Observability Amazon EKS add-on version `4.1.0` or later.
+ The EBS CSI driver `1.42.0` add-on or Helm chart must be installed on the cluster with metrics enabled.
  + To enable the metrics when you are using Amazon EBS CSI driver add-on, use the following option when you create or update the add-on. `--configuration-values '{ "node": { "enableMetrics": true } }'`
  + To enable the metrics if you are using Helm chart, use the following option when you create or update the add-on. `--set node.enableMetrics=true`

The metrics that are collected are listed in the following table. 


| Metric name | Dimensions | Description | 
| --- | --- | --- | 
|  `node_diskio_ebs_total_read_ops` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total number of completed read operations. | 
|  `node_diskio_ebs_total_write_ops` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total number of completed write operations. | 
|  `node_diskio_ebs_total_read_bytes` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total number of read bytes transferred. | 
|  `node_diskio_ebs_total_write_bytes` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total number of write bytes transferred. | 
|  `node_diskio_ebs_total_read_time` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time spent, in microseconds, by all completed read operations. | 
|  `node_diskio_ebs_total_write_time` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time spent, in microseconds, by all completed write operations. | 
|  `node_diskio_ebs_volume_performance_exceeded_iops` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time, in microseconds, that IOPS demand exceeded the volume's provisioned IOPS performance. | 
|  `node_diskio_ebs_volume_performance_exceeded_tp` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time, in microseconds, that throughput demand exceeded the volume's provisioned throughput performance. | 
|  `node_diskio_ebs_ec2_instance_performance_exceeded_iops` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time, in microseconds, that the EBS volume exceeded the attached Amazon EC2 instance's maximum IOPS performance. | 
|  `node_diskio_ebs_ec2_instance_performance_exceeded_tp` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The total time, in microseconds, that the EBS volume exceeded the attached Amazon EC2 instance's maximum throughput performance. | 
|  `node_diskio_ebs_volume_queue_length` |  `ClusterName` `ClusterName`, `NodeName`, `InstanceId` `ClusterName`, `NodeName`, `InstanceId` `VolumeId`  | The number of read and write operations waiting to be completed. |