View a markdown version of this page

Slurm metrics in AWS PCS - AWS PCS

Slurm metrics in AWS PCS

AWS PCS supports Slurm's metrics feature, which exposes real-time cluster data through HTTP endpoints compatible with Prometheus and other monitoring systems. For details, including performance impact and security considerations, see the Metrics Guide in the Slurm documentation.

Prerequisites

Before enabling Slurm metrics, ensure you have:

  • Cluster version: Slurm version 25.11 or higher.

  • Security group: Rules allowing HTTP traffic on port 6817 from your desired sources.

Enable the metrics endpoint

Set the following cluster-level custom Slurm settings:

  • MetricsType – Must specify a supported metrics plugin, such as metrics/openmetrics.

  • CommunicationParameters – Must include enable_http.

    Important

    Enabling enable_http exposes an unauthenticated HTTP endpoint. Anyone with network access to port 6817 can read cluster, job, and node metrics. Use security group rules to restrict access to trusted sources only.

  • PrivateData – Must not be set.

For additional information on custom Slurm settings, see Configuring custom Slurm settings in AWS PCS.

Use the metrics endpoint

Query the metrics endpoint from a host with network access to the controller:

curl http://controller-ip:6817/metrics

For additional information on available metrics and scraping configuration, see the Metrics Guide in the Slurm documentation.