Reboot a compute node using Slurm in AWS PCS
Use Slurm's native reboot command to resolve performance issues, clear resource problems, or recover from degraded states without loss of EC2 instance capacity.
Prerequisites
-
Slurm Admin privileges (root user access)
-
Access to a login node in the AWS PCS cluster
Procedure
-
Connect to a login node through the EC2 console.
-
In the EC2 console, choose Instances.
-
Select your login node instance.
-
Choose Connect.
-
-
Identify the target compute node name using
sinfoorscontrol show node.sinfo # or scontrol show node -
Execute the reboot command using one of these options:
Warning
Don't use
nextstate=DOWNwith thescontrol rebootcommand. This parameter marks the node as unhealthy and triggers instance replacement.-
Basic reboot (waits for node to become idle):
scontrol rebootnodename -
Immediate reboot (drains node and reboots when jobs complete):
scontrol reboot ASAPnodename -
Reboot with reason:
scontrol reboot ASAP reason="troubleshooting"nodename -
Reboot with resume state:
scontrol reboot ASAP nextstate=RESUMEnodename
-
-
Monitor reboot progress using
scontrol show node.scontrol show nodenodename -
Verify the node returns to service after reboot completion.