View a markdown version of this page

Troubleshooting AWS PCS cluster version updates - AWS PCS

Troubleshooting AWS PCS cluster version updates

This topic helps you identify and resolve common problems that can occur when updating the scheduler version on a cluster.

Compute nodes fail to connect after update

Common cause

After updating the cluster, newly launched compute nodes fail to register with the controller. The nodes start but never appear in sinfo output, and the compute node group may cycle through instances repeatedly. This occurs when the compute node group's AMI contains a scheduler version that falls outside the compatibility window of the new controller version.

For example, if you update a cluster from 24.11 to 25.11 but the compute node group still uses an AMI with Slurm 23.11, new instances cannot connect because 23.11 is outside the compatibility window of 25.11.

How to diagnose

You can confirm this issue by checking logs in two places:

Scheduler logs

If you have scheduler logging enabled, check the scheduler logs in CloudWatch Logs for errors similar to the following:

error: unpack_header: protocol_version 10240 not supported error: slurm_unpack_received_msg: [ip-10-0-1-23] Incompatible versions of client and server code

These errors indicate that a compute node with an incompatible scheduler version is trying to connect to the controller. For information about setting up scheduler logging, see Scheduler logs in AWS PCS.

Compute node instance logs

Retrieve the instance console output or connect via Systems Manager and check the bootstrap log for errors similar to the following:

error: _fetch_child: failed to fetch remote configs: Incompatible versions of client and server error: _establish_configuration: failed to load configs error: slurmd initialization failed

For more information about retrieving instance logs, see Retrieve instance logs.

Resolution

Update the compute node group to use an AMI that contains a scheduler version within the compatibility window of your new cluster version:

aws pcs update-compute-node-group \ --cluster-identifier my-cluster \ --compute-node-group-identifier my-cng \ --ami-id ami-0123456789abcdef0

To determine which scheduler versions are compatible with your cluster, see Version compatibility.

For information about building custom AMIs with the correct scheduler version, see Amazon Machine Images (AMIs) for AWS PCS.

Update request rejected with ValidationException

Common cause

The UpdateCluster request returns immediately with a ValidationException error indicating the update is not supported. This occurs when:

  • The target version is outside the compatibility window of the current version.

  • The target version is designated as End of Life (EOL) and is no longer a valid update target.

  • The target version is older than or equal to the current version (downgrades are not supported).

Resolution

If the target version is outside the compatibility window, perform the update in multiple steps. Each step must target a supported version within the compatibility window. For example, to go from 23.11 to 25.11, first update to 25.05, wait for the cluster to return to ACTIVE, then update to 25.11.

If the target version is EOL, choose a newer supported version instead. For information about supported versions, see Slurm versions in AWS PCS.

Cluster remains in UPDATING state

Common cause

The cluster stays in UPDATING state for longer than expected (more than 20 minutes). This can occur due to transient internal issues during the update process.

Resolution

AWS PCS automatically recovers clusters that are stuck in UPDATING state. If the cluster does not return to ACTIVE or UPDATE_FAILED within 30 minutes, contact AWS Support for assistance.

Cluster enters UPDATE_FAILED state after update

Common cause

The cluster transitions to UPDATE_FAILED state during the update. This can occur when transient service errors prevent the update from completing successfully.

Resolution

Retry the update by submitting the same UpdateCluster request again. Clusters in UPDATE_FAILED state accept new update requests. If the update continues to fail, contact AWS Support.

Compute nodes fail to start due to configuration parsing errors

Common cause

After updating the cluster, compute nodes fail to start and never appear in sinfo output. The compute node instance logs show errors similar to the following:

error: _parse_next_key: Parsing error at unrecognized key: HashPlugin error: Invalid DebugFlag: AuditRPCs fatal: Unable to process configuration file

This occurs when compute nodes running scheduler version 23.11 receive configuration from a newer cluster version. Scheduler versions after 23.11 introduced new configuration directives that 23.11 cannot parse. Unlike other versions within the compatibility window, 23.11 compute nodes cannot connect to a newer cluster because they fatally error on unrecognized configuration keys.

This issue can also occur if your custom AMI uses a AWS PCS agent version older than v1.4.0. Older agent versions do not support automatic version fallback for the compute node daemon.

Resolution

Rebuild your custom AMI with the following requirements:

  • Scheduler version 24.05 or later

  • AWS PCS agent version 1.4.0 or later

Then update the compute node group to use the new AMI:

aws pcs update-compute-node-group \ --cluster-identifier my-cluster \ --compute-node-group-identifier my-cng \ --ami-id ami-0123456789abcdef0

For information about building custom AMIs, see Amazon Machine Images (AMIs) for AWS PCS. For information about AWS PCS agent versions, see AWS PCS agent versions.

Cluster unavailable after update due to invalid QoS configuration

Common cause

After updating to version 25.11, the cluster enters UPDATE_FAILED state or the scheduler becomes unavailable. You cannot submit jobs or run scheduler commands. The scheduler logs show errors similar to the following:

error: Invalid Allow/DenyQOS value: low fatal: Partition my-queue has an invalid DenyQOS (low), please check your configuration

This occurs when a queue (partition) references a QoS name through AllowQOS, DenyQOS, or QOS settings that does not exist in the Slurm accounting database. Slurm 25.11 introduced stricter validation of QoS references at scheduler startup. Earlier versions allowed references to non-existent QoS names without error.

Resolution

Before updating to version 25.11, verify that all QoS names referenced in your queue configurations exist in the accounting database. Connect to a login node and run the following command to check if a QoS exists:

sacctmgr show qos where name=low format=name

If the QoS does not exist, create it before attempting the update:

sacctmgr add qos low

Alternatively, remove the QoS reference from your queue configuration by updating the queue's Slurm custom settings to remove the AllowQOS, DenyQOS, or QOS parameter before updating the cluster.