SageMaker HyperPod cluster metrics
Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.
Metric category | Enabled by default? | Additional advanced metrics available? | Available under which Grafana dashboards? |
---|---|---|---|
Training metrics | Yes | Yes | Training |
Inference metrics | Yes | No | Inference |
Task governance metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. |
Scaling metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. |
Cluster metrics | Yes | Yes | Cluster |
Instance metrics | Yes | Yes | Cluster |
Accelerated compute metrics | Yes | Yes | Task, Cluster |
Network metrics | No | Yes | Cluster |
File system | Yes | No | File system |
The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.
Training metrics
Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
Kubeflow metrics | https://github.com/kubeflow/trainer |
Yes | Kubeflow |
Kubernetes pod metrics | https://github.com/kubernetes/kube-state-metrics |
Yes | Kubernetes |
training_uptime_percentage |
Percentage of training time out of the total window size | No | SageMaker HyperPod training operator |
training_manual_recovery_count |
Total number of manual restarts performed on the job | No | SageMaker HyperPod training operator |
training_manual_downtime_ms |
Total time in milliseconds the job was down due to manual interventions | No | SageMaker HyperPod training operator |
training_auto_recovery_count |
Total number of automatic recoveries | No | SageMaker HyperPod training operator |
training_auto_recovery_downtime |
Total infrastructure overhead time in milliseconds during fault recovery | No | SageMaker HyperPod training operator |
training_fault_count |
Total number of faults encountered during training | No | SageMaker HyperPod training operator |
training_fault_type_count |
Distribution of faults by type | No | SageMaker HyperPod training operator |
training_fault_recovery_time_ms |
Recovery time in milliseconds for each type of fault | No | SageMaker HyperPod training operator |
training_time_ms |
Total time in milliseconds spent in actual training | No | SageMaker HyperPod training operator |
Inference metrics
Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
model_invocations_total |
Total number of invocation requests to the model | Yes | SageMaker HyperPod inference operator |
model_errors_total |
Total number of errors during model invocation | Yes | SageMaker HyperPod inference operator |
model_concurrent_requests |
Active concurrent model requests | Yes | SageMaker HyperPod inference operator |
model_latency_milliseconds |
Model invocation latency in milliseconds | Yes | SageMaker HyperPod inference operator |
model_ttfb_milliseconds |
Model time to first byte latency in milliseconds | Yes | SageMaker HyperPod inference operator |
TGI | These metrics can be used to monitor the performance of
TGI, auto-scale deployment and to help identify bottlenecks.
For a detailed list of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md |
Yes | Model container |
LMI | These metrics can be used to monitor the performance of
LMI, and to help identify bottlenecks. For a detailed list
of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md |
Yes | Model container |
Task governance metrics
Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
Kueue | See https://kueue.sigs.k8s.io/docs/reference/metrics/ |
No | Kueue |
Scaling metrics
Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
KEDA Operator Metrics | See https://keda.sh/docs/2.17/integrations/prometheus/#operator |
No | Kubernetes Event-driven Autoscaler (KEDA) |
KEDA Webhook Metrics | See https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks |
No | Kubernetes Event-driven Autoscaler (KEDA) |
KEDA Metrics server Metrics | See https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server |
No | Kubernetes Event-driven Autoscaler (KEDA) |
Cluster metrics
Use these metrics to monitor overall cluster health and resource allocation.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
Cluster health | Kubernetes API server metrics. See https://kubernetes.io/docs/reference/instrumentation/metrics/ |
Yes | Kubernetes |
Kubestate | See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources |
Limited | Kubernetes |
KubeState Advanced | See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources |
No | Kubernetes |
Instance metrics
Use these metrics to monitor individual instance performance and health.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
Node Metrics | See https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default |
Yes | Kubernetes |
Container Metrics | Container metrics exposed by Cadvisor. See https://github.com/google/cadvisor |
Yes | Kubernetes |
Accelerated compute metrics
Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
NVIDIA GPU | DCGM metrics. See https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv |
Limited |
NVIDIA Data Center GPU Manager (DCGM) |
NVIDIA GPU (advanced) |
DCGM metrics that are commented out in the following CSV
file: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv |
No |
NVIDIA Data Center GPU Manager (DCGM) |
AWS Trainium | Neuron metrics. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters |
No | AWS Neuron Monitor |
Network metrics
Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
EFA | See https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md |
No | Elastic Fabric Adapter |
File system metrics
Metric name or type | Description | Enabled by default? | Metric source |
---|---|---|---|
File system | Amazon FSx for Lustre metrics from Amazon CloudWatch: | Yes | Amazon FSx for Lustre |