Operate - High Performance Computing Lens

Operate

HPCOPS06: How do you monitor your workloads to verify they are operating as expected?

In HPC environments, job-level performance is often one of the most important characteristics not just for end-user experience but also for cost optimization. This performance can be affected by a number of different factors and updates or changes can cause unexpected performance impacts. Your environment is also likely to change frequently with small updates such as for packages and drivers, and any impact of these may go unaccounted for if we only test performance for major upgrades. In such cases, it may be worthwhile creating a procedure that all changes have to go through before being implemented, and adding a step to this procedure which tests if the change impacts your runtime performance.

HPCOPS06-BP01 Test and observe job-level performance for every change

Before moving users to a new cluster environment, you should run a set of representative HPC job benchmarks to confirm that your system is performing as expected. To verify that this performance is maintained, consider periodically rerunning these benchmarks or a subset so that any unexpected changes can be localized and investigated early.

As the projects of your users evolve, the requirements and usage of your HPC environment will also change from the initial set of jobs on which your representative set of testcases were built. To verify that the performance tests are relevant, you could periodically update your testcases, or you can consider alternative methods such as monitoring the jobs that your users are running. For example, by monitoring job logs or percent usage by department or user, you may proactively detect anomalies. You can then investigate whether these anomalies were caused by a known change in usage patterns, or an unexplained performance regression. Set alerts and automated responses where appropriate.

Performance regressions can go undetected as they may not throw any errors, but can result in longer running jobs and increased cost per job. Consider adding operational mechanisms to track metrics of your jobs and building them into a cohesive dashboard. You can use these collected metrics to tune your environment based on real usage, such as rightsizing the tier of throughput and capacity of your file systems, or adding new compute options similar to hardware configurations that are currently oversubscribed.

Implementation guidance

Log job-level statistics, track anomalies and integrate your environment logging into a dashboard.

There are a number of options for tracking the operational performance of your HPC environment which vary in the level of granularity the offer and operational overhead required to run them. Most HPC schedulers have their own tools to track job level metrics, and these can be the easiest place to start as they natively integrate with the scheduler.

If using AWS ParallelCluster with Slurm, leverage Slurm accounting with AWS ParallelCluster to log job-level statistics in an external database. You can then add a method to visualize these metrics so you can easily gain a view across your environment. The ParallelCluster Monitoring dashboard repository is an example of how you can construct a dashboard to track job data. If using AWS Batch, a similar tool is the AWS Batch Runtime Monitoring Dashboards Solution.

Higher level alternative or complementary tracking methods such as tagging cloud resources by project and using them to drill down into cost reports using AWS cost allocation tags to detect anomalies can offer a similar effect with lower operational overhead but reduced granularity. For more information, see Organizing and tracking costs using AWS cost allocation tags. Many tools such as AWS ParallelCluster resources and tagging and AWS Batch resource tagging: Tag your AWS Batch resources integrate with this mechanism natively to simplify automated tagging.