Training metrics Inference metrics Task governance metrics Scaling metrics Cluster metrics Instance metrics Accelerated compute metrics Network metrics File system metrics

SageMaker HyperPod cluster metrics

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.

Metric category	Enabled by default?	Additional advanced metrics available?	Available under which Grafana dashboards?
Training metrics	Yes	Yes	Training
Inference metrics	Yes	No	Inference
Task governance metrics	No	Yes	None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard.
Scaling metrics	No	Yes	None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard.
Cluster metrics	Yes	Yes	Cluster
Instance metrics	Yes	Yes	Cluster
Accelerated compute metrics	Yes	Yes	Task, Cluster
Network metrics	No	Yes	Cluster
File system	Yes	No	File system

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

Training metrics

Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.

Metric name or type	Description	Enabled by default?	Metric source
Kubeflow metrics	https://github.com/kubeflow/trainer	Yes	Kubeflow
Kubernetes pod metrics	https://github.com/kubernetes/kube-state-metrics	Yes	Kubernetes
`training_uptime_percentage`	Percentage of training time out of the total window size	No	SageMaker HyperPod training operator
`training_manual_recovery_count`	Total number of manual restarts performed on the job	No	SageMaker HyperPod training operator
`training_manual_downtime_ms`	Total time in milliseconds the job was down due to manual interventions	No	SageMaker HyperPod training operator
`training_auto_recovery_count`	Total number of automatic recoveries	No	SageMaker HyperPod training operator
`training_auto_recovery_downtime`	Total infrastructure overhead time in milliseconds during fault recovery	No	SageMaker HyperPod training operator
`training_fault_count`	Total number of faults encountered during training	No	SageMaker HyperPod training operator
`training_fault_type_count`	Distribution of faults by type	No	SageMaker HyperPod training operator
`training_fault_recovery_time_ms`	Recovery time in milliseconds for each type of fault	No	SageMaker HyperPod training operator
`training_time_ms`	Total time in milliseconds spent in actual training	No	SageMaker HyperPod training operator

Inference metrics

Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.

Metric name or type	Description	Enabled by default?	Metric source
`model_invocations_total`	Total number of invocation requests to the model	Yes	SageMaker HyperPod inference operator
`model_errors_total`	Total number of errors during model invocation	Yes	SageMaker HyperPod inference operator
`model_concurrent_requests`	Active concurrent model requests	Yes	SageMaker HyperPod inference operator
`model_latency_milliseconds`	Model invocation latency in milliseconds	Yes	SageMaker HyperPod inference operator
`model_ttfb_milliseconds`	Model time to first byte latency in milliseconds	Yes	SageMaker HyperPod inference operator
TGI	These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md.	Yes	Model container
LMI	These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md.	Yes	Model container

Task governance metrics

Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.

Metric name or type	Description	Enabled by default?	Metric source
Kueue	See https://kueue.sigs.k8s.io/docs/reference/metrics/.	No	Kueue

Scaling metrics

Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.

Metric name or type	Description	Enabled by default?	Metric source
KEDA Operator Metrics	See https://keda.sh/docs/2.17/integrations/prometheus/#operator.	No	Kubernetes Event-driven Autoscaler (KEDA)
KEDA Webhook Metrics	See https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks.	No	Kubernetes Event-driven Autoscaler (KEDA)
KEDA Metrics server Metrics	See https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server.	No	Kubernetes Event-driven Autoscaler (KEDA)

Cluster metrics

Use these metrics to monitor overall cluster health and resource allocation.

Metric name or type	Description	Enabled by default?	Metric source
Cluster health	Kubernetes API server metrics. See https://kubernetes.io/docs/reference/instrumentation/metrics/.	Yes	Kubernetes
Kubestate	See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources.	Limited	Kubernetes
KubeState Advanced	See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources.	No	Kubernetes

Instance metrics

Use these metrics to monitor individual instance performance and health.

Metric name or type	Description	Enabled by default?	Metric source
Node Metrics	See https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default.	Yes	Kubernetes
Container Metrics	Container metrics exposed by Cadvisor. See https://github.com/google/cadvisor.	Yes	Kubernetes

Accelerated compute metrics

Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

Metric name or type	Description	Enabled by default?	Metric source
NVIDIA GPU	DCGM metrics. See https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv.	Limited	NVIDIA Data Center GPU Manager (DCGM)
NVIDIA GPU (advanced)	DCGM metrics that are commented out in the following CSV file: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv	No	NVIDIA Data Center GPU Manager (DCGM)
AWS Trainium	Neuron metrics. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters.	No	AWS Neuron Monitor

Network metrics

Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.

Metric name or type	Description	Enabled by default?	Metric source
EFA	See https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md.	No	Elastic Fabric Adapter

File system metrics

Metric name or type	Description	Enabled by default?	Metric source
File system	Amazon FSx for Lustre metrics from Amazon CloudWatch: Monitoring with Amazon CloudWatch.	Yes	Amazon FSx for Lustre

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Custom metrics

Preconfigured alerts