Monitoring and Observability Observability and Metrics

Observability

Tip

Explore best practices through Amazon EKS workshops.

Monitoring and Observability

GPU Metrics Explained

The GPU Utilization metric shows whether the GPU ran any work during the sample window. This metric captures the percentage of time the GPU executed at least one instruction, but it does not reveal how efficiently the GPU used its hardware. A GPU contains multiple Streaming Multiprocessors (SMs), which are the parallel processing units that execute instructions. A 100% utilization reading can mean the GPU ran heavy parallel workloads across all its SMs, or it can mean a single small instruction activated the GPU over the sample period. To understand actual utilization, you need to examine GPU metrics at multiple levels of the hardware architecture. Each Streaming Multiprocessor is built from different core types, and each layer exposes different performance characteristics. Top-level metrics (GPU Utilization, Memory Utilization, GPU Power, and GPU Temperature, visible through nvidia-smi) show whether the device is active. Deeper metrics (SM utilization, SM Activity, and tensor core usage) reveal how efficiently the GPU uses its resources.

Target high GPU power usage

Underutilized GPUs waste compute capacity and increase costs because workloads fail to engage all GPU components simultaneously. For AI/ML workloads on Amazon EKS, track GPU power usage as a proxy to identify actual GPU activity. GPU Utilization reports the percentage of time the GPU executes any kernel, but it does not reveal whether the Streaming Multiprocessors, memory controllers, and tensor cores are all active at the same time. Power usage exposes this gap because fully engaged hardware draws significantly more power than hardware running lightweight kernels or sitting idle between tasks. Compare power draw against the GPU’s thermal design power (TDP) to spot underutilization, then investigate whether your workload is bottlenecked by CPU preprocessing, network I/O, or inefficient batch sizes.

Set up CloudWatch Container Insights on Amazon EKS to identify pods, nodes, or workloads with low GPU power consumption. This tool integrates directly with Amazon EKS and allows you to monitor GPU power consumption and adjust pod scheduling or instance types when power usage falls below your target levels. If you need advanced visualization or custom dashboards, use NVIDIA’s DCGM-Exporter with Prometheus and Grafana for Kubernetes-native monitoring. Both approaches surface key NVIDIA metrics like nvidia_smi_power_draw (GPU power consumption) and nvidia_smi_temperature_gpu (GPU temperature). For a list of metrics, explore https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-NVIDIA-GPU.htm. Look for patterns such as consistently low power usage during specific hours or for particular jobs. These trends help you identify where to consolidate workloads or adjust resource allocation.

Static resource limits in Kubernetes (such as CPU, memory, and GPU counts) often lead to over-provisioning or underutilization, especially for dynamic AI/ML workloads like inference where demand fluctuates. Analyze your utilization trends and consolidate workloads onto fewer GPUs. Ensure each GPU reaches full utilization before you allocate additional ones. This approach reduces waste and lowers costs. For detailed guidance on optimizing scheduling and sharing strategies, see the EKS Compute and Autoscaling best practices

Observability and Metrics

Using Monitoring and Observability Tools for your AI/ML Workloads

Modern AI/ML services require coordination across infrastructure, modeling, and application logic. Platform engineers manage the infrastructure and observability stack. They collect, store, and visualize metrics. AI/ML engineers define model-specific metrics and monitor performance under varying load and data distribution. Application developers consume APIs, route requests, and track service-level metrics and user interactions. Without unified observability practices, these teams work in silos and miss critical signals about system health and performance. Establishing shared visibility across environments ensures all stakeholders can detect issues early and maintain reliable service.

Optimizing Amazon EKS clusters for AI/ML workloads presents unique monitoring challenges, especially around GPU memory management. Without proper monitoring, organizations face out-of-memory (OOM) errors, resource inefficiencies, and unnecessary costs. Effective monitoring ensures better performance, resilience, and lower costs for EKS customers. Use a holistic approach that combines three monitoring layers. First, monitor granular GPU metrics using NVIDIA DCGM Exporter to track GPU power usage, GPU temperature, SM activity, SM occupancy, and XID errors. Second, monitor inference serving frameworks like Ray and vLLM to gain distributed workload insights through their native metrics. Third, collect application-level insights to track custom metrics specific to your workload. This layered approach gives you visibility from hardware utilization through application performance.

Tools and frameworks

Several tools and frameworks provide native, out-of-the-box metrics for monitoring AI/ML workloads. These built-in metrics eliminate the need for custom instrumentation and reduce setup time. The metrics focus on performance aspects such as latency, throughput, and token generation, which are critical for inference serving and benchmarking. Using native metrics allows you to start monitoring immediately without building custom collection pipelines.

vLLM: A high-throughput serving engine for large language models (LLMs) that provides native metrics such as request latency and memory usage.
Ray: A distributed computing framework that emits metrics for scalable AI workloads, including task execution times and resource utilization.
Hugging Face Text Generation Inference (TGI): A toolkit for deploying and serving LLMs, with built-in metrics for inference performance.
NVIDIA genai-perf: A command-line tool for benchmarking generative AI models, measuring throughput, latency, and LLM-specific metrics, such as requests completed in specific time intervals.

Observability methods

We recommend implementing any additional observability mechanisms in one of the following ways.

CloudWatch Container Insights If your organization prefers AWS-native tools with minimal setup, we recommend CloudWatch Container Insights. It integrates with the NVIDIA DCGM Exporter to collect GPU metrics and offers pre-built dashboards for quick insights. Enabled by installing the CloudWatch Observability add-on on your cluster, Container Insights deploys and manages the lifecycle of the NVIDIA DCGM Exporter which collects GPU metrics from Nvidia’s drivers and exposes them to CloudWatch.

After you install Container Insights, CloudWatch automatically detects NVIDIA GPUs in your environment and collects critical health and performance metrics. These metrics appear on curated out-of-the-box dashboards. You can also integrate Ray and vLLM with CloudWatch using the Unified CloudWatch Agent to send their native metrics. This unified approach simplifies observability in EKS environments and lets teams focus on performance tuning and cost optimization instead of building monitoring infrastructure.

For a complete list of available metrics, see Amazon EKS and Kubernetes Container Insights metrics. For step-by-step guidance on implementing GPU monitoring, refer to Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights. For practical examples of optimizing inference latency, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference.

Managed Prometheus and Grafana If your organization needs customized dashboards and advanced visualization capabilities, deploy Prometheus with the NVIDIA DCGM-Exporter and Grafana for Kubernetes-native monitoring. Prometheus scrapes and stores GPU metrics from the DCGM-Exporter, while Grafana provides flexible visualization and alerting capabilities. This approach gives you more control over dashboard design and metric retention compared to CloudWatch Container Insights.

You can extend this monitoring stack by integrating open source frameworks like Ray and vLLM Ray and vLLM to export their native metrics to Prometheus. You can also connect Grafana to an AWS X-Ray data source to visualize distributed traces and identify performance bottlenecks across your inference pipeline. This combination provides end-to-end visibility from GPU-level metrics through application-level request flows.

For step-by-step guidance on deploying this monitoring stack, refer to Monitoring GPU workloads on Amazon EKS using AWS managed open-source services.

Consider Monitoring Core Training & Fine-Tuning Metrics

Monitor core training metrics to track the health and performance of your Amazon EKS cluster and the machine learning workloads running on it. Training workloads have different monitoring requirements than inference workloads because they run for extended periods, consume resources differently, and require visibility into model convergence and data pipeline efficiency. The metrics below help you identify bottlenecks, optimize resource allocation, and ensure training jobs complete successfully. For step-by-step guidance on implementing this monitoring approach, refer to Introduction to observing machine learning workloads on Amazon EKS.

Resource Usage Metrics

Monitor resource usage metrics to validate that your resources are being properly consumed. These metrics help you identify bottlenecks and root cause performance issues.

CPU, Memory, Network, GPU Power and GPU Temperature - Monitor these metrics to ensure allocated resources meet workload demands and identify optimization opportunities. Track metrics like gpu_memory_usage_bytes to identify memory consumption patterns and detect peak usage. Calculate percentiles such as the 95th percentile (P95) to understand the highest memory demands during training. This analysis helps you optimize models and infrastructure to avoid OOM errors and reduce costs.
SM Occupancy, SM Activity, FPxx Activity - Monitor these metrics to understand how the underlying resource on the GPU is being used. Target 0.8 for SM Activity as a rule of thumb.
Node and Pod Resource Utilization - Track resource usage at the node and pod level to identify resource contention and potential bottlenecks. Monitor whether nodes approach capacity limits, which can delay pod scheduling and slow training jobs.
Resource Utilization Compared to Requests and Limits — Compare actual resource usage against configured requests and limits to determine whether your cluster can handle current workloads and accommodate future ones. This comparison reveals whether you need to adjust resource allocations to avoid OOM errors or resource waste.
Internal Metrics from ML Frameworks - Capture internal training and convergence metrics from ML frameworks such as TensorFlow and PyTorch. These metrics include loss curves, learning rate, batch processing time, and training step duration. Visualize these metrics using TensorBoard or similar tools to track model convergence and identify training inefficiencies.

Model Performance Metrics

Monitor model performance metrics to validate that your training process produces models that meet accuracy and business requirements. These metrics help you determine when to stop training, compare model versions, and identify performance degradation.

Accuracy, Precision, Recall, and F1-score — Track these metrics to understand how well your model performs on validation data. Calculate the F1-score on a validation set after each training epoch to assess whether the model is improving and when it reaches acceptable performance levels.
Business-Specific Metrics and KPIs — Define and track metrics that directly measure the business value of your AI/ML initiatives. For a recommendation system, track metrics like click-through rate or conversion rate to ensure the model drives the intended business outcomes.
Performance over time — Compare performance metrics across model versions and training runs to identify trends and detect degradation. Track whether newer model versions maintain or improve performance compared to baseline models. This historical comparison helps you decide whether to deploy new models or investigate training issues.

Data Quality and Drift Metrics

Monitor data quality and drift metrics to ensure your training data remains consistent and representative. Data drift can cause model performance to degrade over time, while data quality issues can prevent models from converging or produce unreliable results.

Statistical Properties of Input Data — Track statistical properties such as mean, standard deviation, and distribution of input features over time to detect data drift or anomalies. Monitor whether feature distributions shift significantly from your baseline training data. For example, if the mean of a critical feature changes by more than two standard deviations, investigate whether your data pipeline has changed or whether the underlying data source has shifted.
Data Drift Detection and Alerts — Implement automated mechanisms to detect and alert on data quality issues before they impact training. Use statistical tests such as the Kolmogorov-Smirnov test or chi-squared test to compare current data distributions with your original training data. Set up alerts when tests detect significant drift so you can retrain models with updated data or investigate data pipeline issues.

Latency and Throughput Metrics

Monitor latency and throughput metrics to identify bottlenecks in your training pipeline and optimize resource utilization. These metrics help you understand where time is spent during training and where to focus optimization efforts.

End-to-End Latency of ML Training Pipelines — Measure the total time for data to flow through your entire training pipeline, from data ingestion to model update. Track this metric across training runs to identify whether pipeline changes improve or degrade performance. High latency often indicates bottlenecks in data loading, preprocessing, or network communication between nodes.
Training Throughput and Processing Rate — Track the volume of data your training pipeline processes per unit of time to ensure efficient resource utilization. Monitor metrics such as samples processed per second or batches completed per minute. Low throughput relative to your hardware capacity suggests inefficiencies in data loading, preprocessing, or model computation that waste GPU cycles.
Checkpoint Save and Restore Latency – Monitor the time required to save model checkpoints to storage (S3, EFS, FSx) and restore them to GPU or CPU memory when resuming jobs or recovering from failures. Slow checkpoint operations extend job recovery time and increase costs. Track checkpoint size, save duration, restore duration, and failure count to identify optimization opportunities such as compression or faster storage tiers.
Data Loading and Preprocessing Time - Measure the time spent loading data from storage and applying preprocessing transformations. Compare this time against model computation time to determine whether your training is data-bound or compute-bound. If data loading consumes more than 20% of total training time, consider optimizing your data pipeline with caching, prefetching, or faster storage.

Error Rates and Failures

Monitor error rates and failures throughout your training pipeline to maintain reliability and prevent wasted compute resources. Undetected errors can cause training jobs to fail silently, produce invalid models, or waste hours of GPU time before you notice problems.

Pipeline Error Monitoring — Track errors across all stages of your ML pipeline, including data preprocessing, model training, and checkpoint operations. Log error types, frequencies, and affected components to quickly identify issues. Common errors include data format mismatches, out-of-memory failures during preprocessing, and checkpoint save failures due to storage limits. Set up alerts when error rates exceed baseline thresholds so you can investigate before errors cascade.
Recurring Error Analysis — Identify and investigate patterns in recurring errors to prevent future failures and improve pipeline reliability. Analyze logs to find whether specific data samples, batch sizes, or training configurations consistently cause failures. For example, if certain input data types trigger preprocessing errors, add validation checks earlier in the pipeline or update your data cleaning logic. Track the mean time between failures (MTBF) to measure whether your pipeline reliability improves over time.“

Kubernetes and EKS Specific Metrics

Monitor Kubernetes and EKS metrics to ensure your cluster infrastructure remains healthy and can support your training workloads. These metrics help you detect infrastructure issues before they cause training job failures or performance degradation.

Kubernetes Cluster State Metrics — Monitor the health and status of Kubernetes objects including pods, nodes, deployments, and services. Track pod status to identify pods stuck in pending, failed, or crash loop states. Monitor node conditions to detect issues like disk pressure, memory pressure, or network unavailability. Use kubectl or monitoring tools to check these metrics continuously and set up alerts when pods fail to start or nodes become unschedulable.
Training Pipeline Execution Metrics — Track successful and failed pipeline runs, job durations, step completion times, and orchestration errors. Monitor whether training jobs complete within expected time windows and whether failure rates increase over time. Track metrics such as job success rate, average job duration, and time to failure. These metrics help you identify whether infrastructure issues, configuration problems, or data quality issues cause training failures.
AWS Service Metrics — Track metrics for AWS services that support your EKS infrastructure and training workloads. Monitor S3 metrics such as request latency, error rates, and throughput to ensure data loading performance remains consistent. Track EBS volume metrics including IOPS, throughput, and queue length to detect storage bottlenecks. Monitor VPC flow logs and network metrics to identify connectivity issues between nodes or to external services.
Kubernetes Control Plane Metrics — Monitor the API server, scheduler, controller manager, and etcd database to detect performance issues or failures that affect cluster operations. Track API server request latency, request rate, and error rate to ensure the control plane responds quickly to scheduling requests. Monitor etcd database size, commit duration, and leader changes to detect stability issues. High API server latency or frequent etcd leader changes can delay pod scheduling and extend training job startup times.

Application and Instance Logs

Collect and analyze application and instance logs to diagnose issues that metrics alone cannot explain. Logs provide detailed context about errors, state changes, and system events that help you understand why training jobs fail or perform poorly. Correlating logs with metrics allows you to pinpoint root causes faster.

Application Logs - Collect application logs from your training jobs, data pipelines, and ML frameworks to identify bottlenecks and diagnose failures. These logs capture detailed information about job execution, including data loading errors, model initialization failures, checkpoint save errors, and framework-specific warnings. Correlate log timestamps with metric spikes to understand what caused performance degradation or failures. For example, if GPU utilization drops suddenly, check application logs for errors indicating data pipeline stalls or preprocessing failures. Use centralized logging tools like CloudWatch Logs or Fluent Bit to aggregate logs from all pods and make them searchable.
Instance Logs - Collect instance-level logs such as system journal logs and dmesg output to detect hardware issues and kernel-level problems. These logs reveal issues like GPU driver errors, memory allocation failures, disk I/O errors, and network interface problems that may not appear in application logs. Correlate instance logs with application logs and metrics to determine whether training failures stem from hardware problems or application issues. For example, if a training job fails with an out-of-memory error, check dmesg logs for kernel OOM killer messages that indicate whether the system ran out of memory or whether the application exceeded its container limits. Set up alerts for critical hardware errors such as GPU XID errors or disk failures so you can replace failing instances before they cause widespread training disruptions.

The following sections show how to collect the metrics described above using two AWS-recommended approaches: CloudWatch Container Insights and Amazon Managed Prometheus Amazon Managed Prometheus with Amazon Managed Grafana. Choose CloudWatch Container Insights if you prefer AWS-native tools with minimal setup and pre-built dashboards. Choose Amazon Managed Prometheus with Amazon Managed Grafana if you need customized dashboards, advanced visualization capabilities, or want to integrate with existing Prometheus-based monitoring infrastructure. For a complete list of available Container Insights metrics, see Amazon EKS and Kubernetes Container Insights metrics.

Consider Monitoring Real-time Online Inference Metrics

In real-time systems, low latency is critical for providing timely responses to users or other dependent systems. High latency can degrade user experience or violate performance requirements. Components that influence inference latency include model loading time, pre-processing time, actual prediction time, post-processing time, network transmission time. We recommend monitoring inference latency to ensure low-latency responses that meet service-level agreements (SLAs) and developing custom metrics for the following. Test under expected load, include network latency, account for concurrent requests, and test with varying batch sizes.

Time to First Token (TTFT) — Amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). For example, in chatbots, you’d check how long it takes to generate the first piece of output (token) after the user asks a question.
End-to-End Latency — This is the total time from when a request is received to when the response is sent back. For example, measure time from request to response.
Output Tokens Per Second (TPS) — Indicates how quickly your model generates new tokens after it starts responding. For example, in chatbots, you’d track generation speed for language models for a baseline text.
Error Rate — Tracks failed requests, which can indicate performance issues. For example, monitor failed requests for large documents or certain characters.
Throughput — Measure the number of requests or operations the system can handle per unit of time. For example, track requests per second to handle peak loads.

K/V (Key/Value) cache can be a powerful optimization technique for inference latency, particularly relevant for transformer-based models. K/V cache stores the key and value tensors from previous transformer layer computations, reducing redundant computations during autoregressive inference, particularly in large language models (LLMs). Cache Efficiency Metrics (specifically for K/V or a session cache use):

Cache hit/miss ratio — For inference setups leveraging caching (K/V or embedding caches), measure how often cache is helping. Low hit rates may indicate suboptimal cache config or workload changes, both of which can increase latency.

In subsequent topics, we demonstrate gathering data for a few of the metrics mentioned above. We will provide examples with the two AWS recommended approaches: AWS-native CloudWatch Container Insights and open-source Amazon Managed Prometheus with Amazon Managed Grafana. You would choose one of these solutions based on your overall observability needs. See Amazon EKS and Kubernetes Container Insights metrics for the complete list of Container Insights metrics.

Tracking GPU Memory Usage

As discussed in the Consider Monitoring Core Training & Fine-Tuning Metrics topic, GPU memory usage is essential to prevent out-of-memory (OOM) errors and ensure efficient resource utilization. The following examples show how to instrument your training application to expose a custom histogram metric, gpu_memory_usage_bytes, and calculate the P95 memory usage to identify peak consumption. Be sure to test with a sample training job (e.g., fine-tuning a transformer model) in a staging environment.

AWS-Native CloudWatch Container Insights Example

This sample demonstrates how to instrument your training application to expose gpu_memory_usage_bytes as a histogram using the AWS-native approach. Note that your AI/ML container must be configured to emit structured logs in CloudWatch Embedded Metrics Format (EMF) format. CloudWatch logs parses EMF and publishes the metrics. Use aws_embedded_metrics in your training application to send structured logs in EMF format to CloudWatch Logs, which extracts GPU metrics.


from aws_embedded_metrics import metric_scope
import torch
import numpy as np

memory_usage = []

@metric_scope
def log_gpu_memory(metrics):
    # Record current GPU memory usage
    mem = torch.cuda.memory_allocated()
    memory_usage.append(mem)

    # Log as histogram metric
    metrics.set_namespace("MLTraining/GPUMemory")
    metrics.put_metric("gpu_memory_usage_bytes", mem, "Bytes", "Histogram")

    # Calculate and log P95 if we have enough data points
    if len(memory_usage) >= 10:
        p95 = np.percentile(memory_usage, 95)
        metrics.put_metric("gpu_memory_p95_bytes", p95, "Bytes")
        print(f"Current memory: {mem} bytes, P95: {p95} bytes")

# Example usage in training loop
for epoch in range(20):
    # Your model training code would go here
    log_gpu_memory()

Prometheus and Grafana Example

This sample demonstrates how to instrument your training application to expose gpu_memory_usage_bytes` as a histogram using the Prometheus client library in Python.


from prometheus_client import Histogram
from prometheus_client import start_http_server
import pynvml

start_http_server(8080)
memory_usage = Histogram(
    'gpu_memory_usage_bytes',
    'GPU memory usage during training',
    ['gpu_index'],
    buckets=[1e9, 2e9, 4e9, 8e9, 16e9, 32e9]
)

# Function to get GPU memory usage
def get_gpu_memory_usage():
    if torch.cuda.is_available():
        # Get the current GPU device
        device = torch.cuda.current_device()

        # Get memory usage in bytes
        memory_allocated = torch.cuda.memory_allocated(device)
        memory_reserved = torch.cuda.memory_reserved(device)

        # Total memory usage (allocated + reserved)
        total_memory = memory_allocated + memory_reserved

        return device, total_memory
    else:
        return None, 0

# Get GPU memory usage
gpu_index, memory_used = get_gpu_memory_usage()

Track Inference Request Duration for Real-Time Online Inference

As discussed in the Consider Monitoring Core Training & Fine-Tuning Metrics topic, low latency is critical for providing timely responses to users or other dependent systems. The following examples show how to track a custom histogram metric, inference_request_duration_seconds, exposed by your inference application. Calculate the 95th percentile (P95) latency to focus on worst-case scenarios, test with synthetic inference requests (e.g., via Locust) in a staging environment, and set alert thresholds (e.g., >500ms) to detect SLA violations.

AWS-Native CloudWatch Container Insights Example

This sample demonstrates how to create a custom histogram metric in your inference application for inference_request_duration_seconds using AWS CloudWatch Embedded Metric Format.


import boto3
import time
from aws_embedded_metrics import metric_scope, MetricsLogger

cloudwatch = boto3.client('cloudwatch')

@metric_scope
def log_inference_duration(metrics: MetricsLogger, duration: float):
    metrics.set_namespace("ML/Inference")
    metrics.put_metric("inference_request_duration_seconds", duration, "Seconds", "Histogram")
    metrics.set_property("Buckets", [0.1, 0.5, 1, 2, 5])

@metric_scope
def process_inference_request(metrics: MetricsLogger):
    start_time = time.time()

    # Your inference processing code here
    # For example:
    # result = model.predict(input_data)

    duration = time.time() - start_time
    log_inference_duration(metrics, duration)

    print(f"Inference request processed in {duration} seconds")

# Example usage
process_inference_request()

Prometheus and Grafana Example

This sample demonstrates how to create a custom histogram metric in your inference application for inference_request_duration_seconds using the Prometheus client library in Python:


from prometheus_client import Histogram
from prometheus_client import start_http_server
import time

start_http_server(8080)
request_duration = Histogram(
    'inference_request_duration_seconds',
    'Inference request latency',
    buckets=[0.1, 0.5, 1, 2, 5]
)
start_time = time.time()
# Process inference request
request_duration.observe(time.time() - start_time)

In Grafana, use the query histogram_quantile(0.95, sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, pod)) to visualize P95 latency trends. To learn more, see Prometheus Histogram Documentation and Prometheus Client Documentation.

Track Token Throughput for Real-Time Online Inference

As discussed in the Consider Monitoring Core Training & Fine-Tuning Metrics topic, we recommend monitoring token processing time to gauge model performance and optimize scaling decisions. The following examples show how to track a custom histogram metric, token_processing_duration_seconds, exposed by your inference application. Calculate the 95th percentile (P95) duration to analyze processing efficiency, test with simulated request loads (e.g., 100 to 1000 requests/second) in a non-production cluster, and adjust KEDA triggers to optimize scaling.

AWS-Native CloudWatch Container Insights Example

This sample demonstrates how to create a custom histogram metric in your inference application for token_processing_duration_seconds using AWS CloudWatch Embedded Metric Format. It uses dimensions (`set_dimension) with a custom `get_duration_bucket function to categorize durations into buckets (e.g., "⇐0.01", ">1").


import boto3
import time
from aws_embedded_metrics import metric_scope, MetricsLogger

cloudwatch = boto3.client('cloudwatch')

@metric_scope
def log_token_processing(metrics: MetricsLogger, duration: float, token_count: int):
    metrics.set_namespace("ML/TokenProcessing")
    metrics.put_metric("token_processing_duration_seconds", duration, "Seconds")
    metrics.set_dimension("ProcessingBucket", get_duration_bucket(duration))
    metrics.set_property("TokenCount", token_count)

def get_duration_bucket(duration):
    buckets = [0.01, 0.05, 0.1, 0.5, 1]
    for bucket in buckets:
        if duration <= bucket:
            return f"<={bucket}"
    return f">{buckets[-1]}"

@metric_scope
def process_tokens(input_text: str, model, tokenizer, metrics: MetricsLogger):
    tokens = tokenizer.encode(input_text)
    token_count = len(tokens)

    start_time = time.time()
    # Process tokens (replace with your actual processing logic)
    output = model(tokens)
    duration = time.time() - start_time

    log_token_processing(metrics, duration, token_count)
    print(f"Processed {token_count} tokens in {duration} seconds")
    return output

Prometheus and Grafana Example

This sample demonstrates how to create a custom histogram metric in your inference application for token_processing_duration_seconds using the Prometheus client library in Python.


from prometheus_client import Histogram
from prometheus_client import start_http_server
import time

start_http_server(8080)
token_duration = Histogram(
    'token_processing_duration_seconds',
    'Token processing time per request',
    buckets=[0.01, 0.05, 0.1, 0.5, 1]
)
start_time = time.time()
# Process tokens
token_duration.observe(time.time() - start_time)

In Grafana, use the query histogram_quantile(0.95, sum(rate(token_processing_duration_seconds_bucket[5m])) by (le, pod))` to visualize P95 processing time trends. To learn more, see Prometheus Histogram Documentation and Prometheus Client Documentation.

Measure Checkpoint Restore Latency

As discussed in the Consider Monitoring Core Training & Fine-Tuning Metrics topic, checkpoint latency is a critical metric during multiple phases of the model lifecycle. The following examples show how to track a custom histogram metric, checkpoint_restore_duration_seconds`, exposed by your application. Calculate the 95th percentile (P95) duration to monitor restore performance, test with Spot interruptions in a non-production cluster, and set alert thresholds (e.g., <30 seconds) to detect delays.

AWS-Native CloudWatch Container Insights Example

This sample demonstrates how to instrument your batch application to expose checkpoint_restore_duration_seconds as a histogram using CloudWatch Insights:


import boto3
import time
import torch
from aws_embedded_metrics import metric_scope, MetricsLogger

@metric_scope
def log_checkpoint_restore(metrics: MetricsLogger, duration: float):
    metrics.set_namespace("ML/ModelOperations")
    metrics.put_metric("checkpoint_restore_duration_seconds", duration, "Seconds", "Histogram")
    metrics.set_property("Buckets", [1, 5, 10, 30, 60])
    metrics.set_property("CheckpointSource", "s3://my-bucket/checkpoint.pt")

@metric_scope
def load_checkpoint(model, checkpoint_path: str, metrics: MetricsLogger):
    start_time = time.time()

    # Load model checkpoint
    model.load_state_dict(torch.load(checkpoint_path))

    duration = time.time() - start_time
    log_checkpoint_restore(metrics, duration)

    print(f"Checkpoint restored in {duration} seconds")

Prometheus and Grafana Example

This sample demonstrates how to instrument your batch application to expose checkpoint_restore_duration_seconds as a histogram using the Prometheus client library in Python:


from prometheus_client import Histogram
from prometheus_client import start_http_server
import torch

start_http_server(8080)
restore_duration = Histogram(
    'checkpoint_restore_duration_seconds',
    'Time to restore checkpoint',
    buckets=[1, 5, 10, 30, 60]
)
with restore_duration.time():
    model.load_state_dict(torch.load("s3://my-bucket/checkpoint.pt"))

In Grafana, use the query histogram_quantile(0.95, sum(rate(checkpoint_restore_duration_seconds_bucket[5m]) by (le)) to visualize P95 restore latency trends. To learn more, see Prometheus Histogram Documentation and Prometheus Client Documentation.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Storage

Performance