Troubleshooting AWS PCS cluster version updates
This topic helps you identify and resolve common problems that can occur when updating the scheduler version on a cluster.
Compute nodes fail to connect after update
Common cause
After updating the cluster, newly launched compute nodes fail to register with the controller. The nodes start but never appear in sinfo output, and the compute node group may cycle through instances repeatedly. This occurs when the compute node group's AMI contains a scheduler version that falls outside the compatibility window of the new controller version.
For example, if you update a cluster from 24.11 to 25.11 but the compute node group still uses an AMI with Slurm 23.11, new instances cannot connect because 23.11 is outside the compatibility window of 25.11.
How to diagnose
You can confirm this issue by checking logs in two places:
Scheduler logs
If you have scheduler logging enabled, check the scheduler logs in CloudWatch Logs for errors similar to the following:
error: unpack_header: protocol_version 10240 not supported error: slurm_unpack_received_msg: [ip-10-0-1-23] Incompatible versions of client and server code
These errors indicate that a compute node with an incompatible scheduler version is trying to connect to the controller. For information about setting up scheduler logging, see Scheduler logs in AWS PCS.
Compute node instance logs
Retrieve the instance console output or connect via Systems Manager and check the bootstrap log for errors similar to the following:
error: _fetch_child: failed to fetch remote configs: Incompatible versions of client and server error: _establish_configuration: failed to load configs error: slurmd initialization failed
For more information about retrieving instance logs, see Retrieve instance logs.
Resolution
Update the compute node group to use an AMI that contains a scheduler version within the compatibility window of your new cluster version:
aws pcs update-compute-node-group \ --cluster-identifiermy-cluster\ --compute-node-group-identifiermy-cng\ --ami-idami-0123456789abcdef0
To determine which scheduler versions are compatible with your cluster, see Version compatibility.
For information about building custom AMIs with the correct scheduler version, see Amazon Machine Images (AMIs) for AWS PCS.
Update request rejected with ValidationException
Common cause
The UpdateCluster request returns immediately with a ValidationException error indicating the update is not supported. This occurs when:
-
The target version is outside the compatibility window
of the current version. -
The target version is designated as End of Life (EOL) and is no longer a valid update target.
-
The target version is older than or equal to the current version (downgrades are not supported).
Resolution
If the target version is outside the compatibility window, perform the update in multiple steps. Each step must target a supported version within the compatibility window. For example, to go from 23.11 to 25.11, first update to 25.05, wait for the cluster to return to ACTIVE, then update to 25.11.
If the target version is EOL, choose a newer supported version instead. For information about supported versions, see Slurm versions in AWS PCS.
Cluster remains in UPDATING state
Common cause
The cluster stays in UPDATING state for longer than expected (more than 20 minutes). This can occur due to transient internal issues during the update process.
Resolution
AWS PCS automatically recovers clusters that are stuck in UPDATING state. If the cluster does not return to ACTIVE or UPDATE_FAILED within 30 minutes, contact AWS Support for assistance.
Cluster enters UPDATE_FAILED state after update
Common cause
The cluster transitions to UPDATE_FAILED state during the update. This can occur when transient service errors prevent the update from completing successfully.
Resolution
Retry the update by submitting the same UpdateCluster request again. Clusters in UPDATE_FAILED state accept new update requests. If the update continues to fail, contact AWS Support.
Compute nodes fail to start due to configuration parsing errors
Common cause
After updating the cluster, compute nodes fail to start and never appear in sinfo output. The compute node instance logs show errors similar to the following:
error: _parse_next_key: Parsing error at unrecognized key:HashPluginerror: Invalid DebugFlag:AuditRPCsfatal: Unable to process configuration file
This occurs when compute nodes running scheduler version 23.11 receive configuration from a newer cluster version. Scheduler versions after 23.11 introduced new configuration directives that 23.11 cannot parse. Unlike other versions within the compatibility window
This issue can also occur if your custom AMI uses a AWS PCS agent version older than v1.4.0. Older agent versions do not support automatic version fallback for the compute node daemon.
Resolution
Rebuild your custom AMI with the following requirements:
-
Scheduler version 24.05 or later
-
AWS PCS agent version 1.4.0 or later
Then update the compute node group to use the new AMI:
aws pcs update-compute-node-group \ --cluster-identifiermy-cluster\ --compute-node-group-identifiermy-cng\ --ami-idami-0123456789abcdef0
For information about building custom AMIs, see Amazon Machine Images (AMIs) for AWS PCS. For information about AWS PCS agent versions, see AWS PCS agent versions.
Cluster unavailable after update due to invalid QoS configuration
Common cause
After updating to version 25.11, the cluster enters UPDATE_FAILED state or the scheduler becomes unavailable. You cannot submit jobs or run scheduler commands. The scheduler logs show errors similar to the following:
error: Invalid Allow/DenyQOS value:lowfatal: Partitionmy-queuehas an invalid DenyQOS (low), please check your configuration
This occurs when a queue (partition) references a QoS name through AllowQOS, DenyQOS, or QOS settings that does not exist in the Slurm accounting database. Slurm 25.11 introduced stricter validation of QoS references at scheduler startup. Earlier versions allowed references to non-existent QoS names without error.
Resolution
Before updating to version 25.11, verify that all QoS names referenced in your queue configurations exist in the accounting database. Connect to a login node and run the following command to check if a QoS exists:
sacctmgr show qos where name=lowformat=name
If the QoS does not exist, create it before attempting the update:
sacctmgr add qoslow
Alternatively, remove the QoS reference from your queue configuration by updating the queue's Slurm custom settings to remove the AllowQOS, DenyQOS, or QOS parameter before updating the cluster.