Set up a Grafana monitoring dashboard for AWS ParallelCluster
Dario La Porta and William Lu, Amazon Web Services
Summary
AWS ParallelCluster helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers. Although AWS ParallelCluster is integrated with Amazon CloudWatch for logging and metrics, it doesn't provide a monitoring dashboard for the workload.
The Grafana dashboard for AWS ParallelCluster
- Supports AWS ParallelCluster v3 
- Uses the latest version of open source packages, including Prometheus, Grafana, Prometheus Slurm Exporter, and NVIDIA DCGM-Exporter 
- Increases the number of CPU cores and GPUs that the Slurm jobs use 
- Adds a job monitoring dashboard 
- Enhances the GPU node monitoring dashboard for nodes with 4 or 8 graphics processing units (GPUs) 
This version of the enhanced solution has been implemented and verified in an AWS customer's HPC production environment.
Prerequisites and limitations
Prerequisites
- AWS ParallelCluster CLI, installed and configured. 
- A supported network configuration for AWS ParallelCluster. This pattern uses the AWS ParallelCluster using two subnets configuration, which requires a public subnet, private subnet, internet gateway, and NAT gateway. 
- All AWS ParallelCluster cluster nodes must have internet access. This is required so that the installation scripts can download the open source software and Docker images. 
- A key pair in Amazon Elastic Compute Cloud (Amazon EC2). Resources that have this key pair have Secure Shell (SSH) access to the head node. 
Limitations
- This pattern is designed to support Ubuntu 20.04 LTS. If you're using a different version of Ubuntu or if you use Amazon Linux or CentOS, then you need to modify the scripts provided with this solution. These modifications are not included in this pattern. 
Product versions
- Ubuntu 20.04 LTS 
- ParallelCluster 3.X 
Billing and cost considerations
- The solution deployed in this pattern is not covered by the free tier. Charges apply for Amazon EC2, Amazon FSx for Lustre, the NAT gateway in Amazon VPC, and Amazon Route 53. 
Architecture
Target architecture
The following diagram shows how a user can access the monitoring dashboard for AWS ParallelCluster on the head node. The head node runs NICE DCV, Prometheus, Grafana, Prometheus Slurm Exporter, Prometheus Node Exporter, and NGINX Open Source. The compute nodes run Prometheus Node Exporter, and they also run NVIDIA DCGM-Exporter if the node contains GPUs. The head node retrieves information from the compute nodes and displays that data in the Grafana dashboard.

In most cases, the head node is not heavily loaded because the job scheduler doesn't require a significant amount of CPU or memory. Users access the dashboard on the head node by using SSL on port 443.
All authorized viewers can anonymously view the monitoring dashboards. Only the Grafana administrator can modify dashboards. You configure a password for the Grafana administrator in the aws-parallelcluster-monitoring/docker-compose/docker-compose.head.yml file.
Tools
AWS services
- NICE DCV is a high-performance remote display protocol that helps you deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions. 
- AWS ParallelCluster helps you deploy and manage high performance computing (HPC) clusters. It supports AWS Batch and Slurm open source job schedulers. 
- Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data. 
- Amazon Virtual Private Cloud (Amazon VPC) helps you launch AWS resources into a virtual network that you’ve defined. 
Other tools
- Docker - is a set of platform as a service (PaaS) products that use virtualization at the operating-system level to deliver software in containers. 
- Grafana - is an open source software that helps you query, visualize, alert on, and explore metrics, logs, and traces. 
- NGINX Open Source - is an open source web server and reverse proxy. 
- NVIDIA Data Center GPU Manager (DCGM) - is a suite of tools for managing and monitoring NVIDIA data center graphics processing units (GPUs) in cluster environments. In this pattern, you use DCGM-Exporter - , which helps you export GPU metrics from Prometheus. 
- Prometheus - is an open source system-monitoring toolkit that collects and stores its metrics as time-series data with associated key-value pairs, which are called labels. In this pattern, you also use Prometheus Slurm Exporter - to collect and export metrics, and you use Prometheus Node Exporter - to export metrics from the compute nodes. 
- Ubuntu - is an open source, Linux-based operating system that is designed for enterprise servers, desktops, cloud environments, and IoT. 
Code repository
The code for this pattern is available in the GitHub pcluster-monitoring-dashboard
Epics
| Task | Description | Skills required | 
|---|---|---|
| Create an S3 bucket. | Create an Amazon S3 bucket. You use this bucket to store the configuration scripts. For instructions, see Creating a bucket in the Amazon S3 documentation. | General AWS | 
| Clone the repository. | Clone the GitHub pcluster-monitoring-dashboard 
 | DevOps engineer | 
| Create an admin password. | 
 | Linux Shell scripting | 
| Copy the required files into the S3 bucket. | Copy the post_install.sh | General AWS | 
| Configure an additional security group for the head node. | 
 | AWS administrator | 
| Configure an IAM policy for the head node. | Create an identity-based policy for the head node. This policy allows the node to retrieve metric data from Amazon CloudWatch. The GitHub repo contains an example policy | AWS administrator | 
| Configure an IAM policy for the compute nodes. | Create an identity-based policy for the compute nodes. This policy allows the node to create the tags that contain the job ID and job owner. The GitHub repo contains an example policy If you use the provided example file, replace the following values: 
 | AWS administrator | 
| Task | Description | Skills required | 
|---|---|---|
| Modify the provided cluster template file. | Create the AWS ParallelCluster cluster. Use the provided cluster.yaml 
 | AWS administrator | 
| Create the cluster. | In the AWS ParallelCluster CLI, enter the following command. This deploys the CloudFormation template and creates the cluster. For more information about this command, see pcluster create-cluster in the AWS ParallelCluster documentation. 
 | AWS administrator | 
| Monitor the cluster creation. | Enter the following command to monitor the cluster creation. For more information about this command, see pcluster describe-cluster in the AWS ParallelCluster documentation. 
 | AWS administrator | 
| Task | Description | Skills required | 
|---|---|---|
| Access to the Grafana portal. | 
 | AWS administrator | 
| Task | Description | Skills required | 
|---|---|---|
| Delete the cluster. | Enter the following command to delete the cluster. For more information about this command, see pcluster delete-cluster in the AWS ParallelCluster documentation. 
 | AWS administrator | 
| Delete the IAM policies. | Delete the policies that you created for the head node and compute node. For more information about deleting policies, see Deleting IAM policies in the IAM documentation. | AWS administrator | 
| Delete the security group and rule. | Delete the security group that you created for the head node. For more information, see Delete security group rules and Delete a security group in the Amazon VPC documentation. | AWS administrator | 
| Delete the S3 bucket. | Delete the S3 bucket that you created to store the configuration scripts. For more information, see Deleting a bucket in the Amazon S3 documentation. | General AWS | 
Troubleshooting
| Issue | Solution | 
|---|---|
| The head node is not accessible in the browser. | Check the security group and confirm that the inbound port 443 is open. | 
| Grafana doesn't open. | On the head node, check the container log for  | 
| Some metrics have no data. | On the head node, check the container logs of all containers. | 
Related resources
AWS documentation
Other AWS resources
- Monitoring dashboard for AWS ParallelCluster - (AWS blog post) 
Other resources