

# Updating a cluster in AWS PCS
<a name="working-with_clusters_update"></a>

AWS PCS lets you update cluster configurations after creation through the UpdateCluster API or console. You can modify cluster settings without rebuilding your infrastructure, which reduces operational overhead and minimizes interruptions.

## Benefits of cluster updates
<a name="update-cluster-benefits"></a>

Updating AWS PCS clusters lets you adapt HPC infrastructure to new requirements without service disruption. Configuration changes take minutes instead of the hour or more needed to rebuild clusters. This capability is important for production environments that require minimal downtime and for teams that need to adjust cluster settings as workload patterns change.

## Supported configuration changes
<a name="update-cluster-supported-settings"></a>

You can modify three main categories of settings:
+ **Accounting configuration** - Enable or disable managed accounting and configure retention settings.
+ **Scale-down behavior** - Adjust the `scaleDownIdleTime` parameter, which controls how long dynamic instances remain idle before AWS PCS automatically terminates them.
+ **Slurm custom settings** - Modify any of the supported Slurm settings that apply at the cluster level, including Prolog, Epilog, and SelectTypeParameters.

## Limitations
<a name="update-cluster-limitations"></a>

You cannot modify certain configurations after cluster creation. These include:
+ Security group configurations
+ VPC subnet selection
+ Cluster size
+ Slurm version
+ Cluster name

These settings are foundational to the cluster's architecture and require creating a new cluster to modify them.

## Prerequisites for cluster updates
<a name="update-cluster-prerequisites"></a>

Before updating a cluster, ensure the following conditions are met:
+ Cluster must be in `ACTIVE`, `UPDATE_FAILED`, or `SUSPENDED` state
+ All associated resources (Queues, Compute Node Groups) must be in `ACTIVE` state
+ You must have appropriate IAM permissions for the UpdateCluster operation
+ No other update operations can be in progress

## Update process and job impact
<a name="update-cluster-process"></a>

During an update operation, compute nodes continue to run existing jobs even when the cluster controller becomes briefly unreachable. However, the system cannot accept new job submissions or make scheduling decisions during this period.

You can monitor cluster updates through both the console and API interfaces. The cluster will transition through the following states during an update:
+ `UPDATING` - Update in progress
+ `ACTIVE` - Update completed successfully
+ `UPDATE_FAILED` - Update encountered an error

## Billing during updates
<a name="update-cluster-billing"></a>

Standard hourly charges for your AWS PCS cluster continue during update operations. When you update a cluster to disable accounting, billing for the accounting feature stops as soon as the cluster enters the `UPDATING` state. When enabling accounting, billing doesn't begin until the cluster successfully completes the update and returns to the `ACTIVE` state.

**Topics**
+ [Benefits of cluster updates](#update-cluster-benefits)
+ [Supported configuration changes](#update-cluster-supported-settings)
+ [Limitations](#update-cluster-limitations)
+ [Prerequisites for cluster updates](#update-cluster-prerequisites)
+ [Update process and job impact](#update-cluster-process)
+ [Billing during updates](#update-cluster-billing)
+ [Update an AWS PCS cluster](working-with_clusters_update_procedure.md)
+ [Frequently asked questions about updating clusters in AWS PCS](working-with_clusters_update_faq.md)
+ [Troubleshooting AWS PCS cluster updates](working-with_clusters_update_troubleshooting.md)

# Update an AWS PCS cluster
<a name="working-with_clusters_update_procedure"></a>

Use these steps to modify scheduler settings, accounting configuration, and Slurm custom settings on your cluster. For more information, see [Custom Slurm settings for AWS PCS clusters](slurm-custom-settings-cluster.md).

## Prerequisites
<a name="update-cluster-procedure-prerequisites"></a>
+ Cluster must be in `ACTIVE`, `UPDATE_FAILED`, or `SUSPENDED` state
+ All associated resources (Queues, Compute Node Groups) must be in `ACTIVE` state
+ No other update operations can be in progress

## Procedure
<a name="update-cluster-procedure-steps"></a>

------
#### [ AWS Management Console ]

1. Open the AWS PCS console at [https://console.aws.amazon.com/pcs/](https://console.aws.amazon.com/pcs/).

1. In the navigation pane, choose **Clusters**.

1. Select the cluster to update.

1. Choose **Edit**.

1. On the Edit cluster page, modify the desired settings:
   + Under **Scheduler configuration**, update **Scale-down idle time** to control how long dynamic instances remain idle before automatic termination.
   + Modify **Prolog**, **Epilog**, and **Select-type parameters** settings as needed.
   + Enable, disable, or configure retention time for **managed accounting**.
   + Under **Additional scheduler settings**, add, edit, or remove **Slurm custom settings**. For more information about supported parameters, see [Custom Slurm settings for AWS PCS clusters](slurm-custom-settings-cluster.md).
**Note**  
Fields that cannot be edited display as read-only and show their current values.

1. Choose **Update** to submit the changes.

1. Monitor the cluster status, which shows as "Updating" during the process. The status changes when the update completes successfully.

------
#### [ AWS CLI ]

1. Open a terminal or command prompt.

1. Verify the cluster status using the following command:

   ```
   aws pcs get-cluster --cluster-identifier my-cluster
   ```

1. Submit an update request using one of the following examples:
   + To enable managed accounting:

     ```
     aws pcs update-cluster --cluster-identifier my-cluster \
     --slurm-configuration 'accounting={mode=STANDARD}'
     ```
   + To update a Slurm Prolog setting:

     ```
     aws pcs update-cluster --cluster-identifier my-cluster \
     --slurm-configuration \
     'SlurmCustomSettings=[{parameterName=Prolog,parameterValue="/path/to/prolog.sh"}]'
     ```
   + To update scale-down idle time:

     ```
     aws pcs update-cluster --cluster-identifier my-cluster \
     --slurm-configuration 'scaleDownIdleTimeInSeconds=300'
     ```

1. Monitor update progress by checking cluster status:

   ```
   aws pcs get-cluster --cluster-identifier my-cluster
   ```

After a successful update request, the command returns the Cluster object with all changes. The cluster status changes from `UPDATING` to `ACTIVE` when complete.

------

# Frequently asked questions about updating clusters in AWS PCS
<a name="working-with_clusters_update_faq"></a>

Get answers to common questions about updating cluster configurations in AWS PCS.

**What settings can I modify?**  
You can modify accounting configuration (enable/disable managed accounting), scale-down behavior (scaleDownIdleTime parameter), and any of the supported Slurm custom settings that apply at the cluster level. You cannot modify security groups, VPC subnets, cluster size, Slurm version, or cluster name.

**Can I queue multiple updates?**  
No. You must wait for the cluster to return to the `ACTIVE` state before submitting another update. All associated resources (Queues, Compute Node Groups) must also be in `ACTIVE` state.

**Can I cancel a cluster update operation?**  
No, you cannot cancel an ongoing cluster update operation.

**Can I submit jobs while my cluster is updating?**  
We recommend that you avoid submitting jobs during cluster updates. The Slurm controller might be unavailable during the update process.

**Will my jobs continue to run during cluster updates?**  
Yes, running jobs continue to execute on compute nodes even when the cluster controller becomes briefly unreachable during the update process. However, job status might not update until the controller becomes available again.

**How is billing affected during updates?**  
Standard hourly charges continue during update operations. When disabling accounting, billing stops when the cluster enters `UPDATING` state. When enabling accounting, billing begins when the cluster successfully returns to `ACTIVE` state.

# Troubleshooting AWS PCS cluster updates
<a name="working-with_clusters_update_troubleshooting"></a>

This topic helps you identify and resolve common problems that can occur when updating cluster configurations.

## Update fails with accounting configuration error
<a name="update-fails-accounting-error"></a>

### Common cause
<a name="accounting-error-cause"></a>

The cluster enters `UPDATE_FAILED` state and the error message indicates an accounting configuration issue. This typically occurs when the accounting configuration is incompatible with the current Slurm version or contains invalid settings.

### Resolution
<a name="accounting-error-resolution"></a>

Review your accounting settings for compatibility with your cluster's Slurm version and submit a corrected update request with valid configuration parameters.

## Update fails with custom settings error
<a name="update-fails-custom-settings-error"></a>

### Common cause
<a name="custom-settings-error-cause"></a>

The cluster enters `UPDATE_FAILED` state and the error message indicates a Slurm custom settings issue. This occurs when you provide invalid Slurm parameter values or unsupported parameter combinations.

### Resolution
<a name="custom-settings-error-resolution"></a>

Validate your Slurm custom settings against the supported parameters and submit a corrected update request with valid parameter values and combinations.

## Cannot submit update request
<a name="cannot-submit-update-request"></a>

### Common cause
<a name="submit-error-cause"></a>

The update button is disabled in the console or the API returns a 400-level error. This occurs when the cluster is not in an appropriate state, associated resources are not active, or there are validation failures in your configuration.

### Resolution
<a name="submit-error-resolution"></a>

Wait for the cluster and all associated resources to reach `ACTIVE` state, then review your configuration for validation errors before resubmitting the update request.

## Validation errors
<a name="validation-errors"></a>

### Common cause
<a name="validation-cause"></a>

The command returns immediately with a 400-level HTTP error and descriptive message. This occurs due to invalid cluster state, resource state, or configuration parameters.

### Resolution
<a name="validation-resolution"></a>

Address the specific validation error mentioned in the response and retry the update operation.