Slurm REST API frequently asked questions in AWS PCS
This section answers frequently asked questions about the Slurm REST API in AWS PCS.
- What is the Slurm REST API?
-
The Slurm REST API is an HTTP interface that allows you to interact with the Slurm workload manager programmatically. You can use standard HTTP methods like GET, POST, and DELETE to submit jobs, monitor cluster status, and manage resources without requiring command-line access to the cluster.
- Can I use tokens generated by
scontrol token? -
No, standard
scontrol tokenoutput is not compatible with AWS PCS. The PCS Slurm REST API requires enriched JWT tokens containing specific identity claims that include username(sun), POSIX user ID(uid), and group IDs(gids). Standard Slurm tokens lack these required claims and will be rejected by the API. - Can I access the API from outside my VPC?
-
No, the REST API endpoint is only accessible from within your VPC using the Slurm controller's private IP address. To enable external access, implement AWS services such as Application Load Balancer with VPC Link, API Gateway, or establish VPC peering or VPN connections for secure connectivity.
- Why does the API use HTTP instead of HTTPS?
-
The Slurm REST API is intended to be an internal endpoint within your cluster's private network. For production deployments requiring encryption, you can implement SSL/TLS termination at a higher level in your architecture, such as through an API gateway, load balancer, or reverse proxy.
- How do I control access to the REST API?
-
Configure your cluster's security group rules to restrict access to port 6820 on the Slurm controller. Set inbound rules to allow connections only from trusted IP ranges or specific sources within your VPC, blocking unauthorized access to the API endpoint.
- How do I rotate the JWT signing key?
-
Put your cluster in maintenance mode with no active instances, then initiate key rotation through AWS Secrets Manager. After rotation completes, re-enable the queues. All existing JWT tokens will become invalid and must be regenerated using the new signing key from Secrets Manager.
- Do I need Slurm accounting enabled to use the REST API?
-
No, Slurm accounting is not required for basic REST API operations like job submission and monitoring. However, the entire
/slurmdbendpoint requires accounting to be active. - What third-party tools work with the AWS PCS REST API?
-
Many existing Slurm REST API clients should work with AWS PCS, including Slurm Exporter for Prometheus, SlurmWeb, and custom applications that follow the standard Slurm REST API format. However, tools that rely on
scontrol tokenfor authentication will need modification to work with AWS PCS JWT requirements. - Are there any additional costs for using the REST API?
-
No, there are no additional charges for enabling or using the Slurm REST API feature. You only pay for the underlying cluster resources as usual.
- How can I troubleshoot the REST API?
-
-
Network connectivity issues
If you cannot reach the API endpoint, you'll see connection timeouts or "connection refused" errors when making HTTP requests to the cluster controller.
What to do: Verify your client is in the same VPC or has proper network routing, and confirm your security group allows HTTP traffic on port 6820 from your source IP or subnet.
-
Slurm REST authentication issues
If your JWT token is invalid, expired, or improperly signed, API requests will return "Protocol authentication error" in the errors field of the response.
Example error message:
{ "errors": [ { "description": "Batch job submission failed", "error_number": 1007, "error": "Protocol authentication error", "source": "slurm_submit_batch_job()" } ] }What to do: Check that your JWT token is properly formatted, not expired, and signed with the correct key from Secrets Manager. Verify that the token is properly formed and includes the required claims and that you're using the correct authentication header format.
-
Job failing to run after submission
If your JWT token is valid but contains incorrect internal structure or content, jobs may have entered a paused (
PD) state with reason codeJobAdminHead. Usescontrol show jobto inspect the job – you'll see<job-id>JobState=PENDING, Reason=JobHeldAdmin, andSystemComment=slurm_cred_create failure, holding job.What to do: The root cause may be mistaken values in JWT. Verify that the token is properly structured and includes the required claims as per the PCS documentation.
-
Working directory permission issues
If the user identity specified in your JWT lacks write permissions to the job's working directory, the job will fail with permission errors, similar to using
sbatch --chdirwith an inaccessible directory.What to do: Ensure the user specified in your JWT token has appropriate permissions for the job's working directory.
-
Still running into problems?
-
Check SchedMD's documentation
on the REST API specification. -
Check the Slurm controller logs for more detailed information on errors (see Scheduler logs in AWS PCS for more details).
-
-