Troubleshooting job submission failures due to MaxJobCount limit

Problem: Job submissions fail with the following error message:

sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying

This error occurs even when the number of running and pending jobs appears to be well below the cluster's job limit.

Cause: The MaxJobCount limit includes all jobs tracked by Slurm, not just running or pending jobs. Completed jobs remain in Slurm's memory for a period of time (by default, 5 minutes) before being purged. During high job throughput periods, the total count of active plus recently completed jobs can exceed the limit.

You can verify the total job count by running the following command on a cluster node:

scontrol show jobs | grep -c JobId

This shows the total number of jobs Slurm is tracking, including completed jobs awaiting purge.

Solution: Consider one of the following approaches:

Create a larger cluster – If your workload consistently requires more concurrent jobs, create a new cluster with a larger size. For more information about cluster sizes and their limits, see Cluster size in AWS PCS.
Reduce job submission rate – Adjust your job submission scripts to submit jobs at a slower rate, allowing completed jobs time to be purged from Slurm's tracking.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshoot compute node bootstrap and registration problems in AWS PCS

Document history