Trying to create a cluster

When using AWS ParallelCluster version 3.5.0 and later to create a cluster, and a cluster creation failed with --rollback-on-failure set to false, use the pcluster describe-cluster CLI command to get status and failure information. In this case, the expected clusterStatus of the pcluster describe-cluster output is CREATE_FAILED. Check the failures section in the output to find the failureCode and failureReason. Then, in the following section, find the matching failureCode for additional troubleshooting help. For more information, see pcluster describe-cluster.

In the following sections, we recommend that you check the logs on the head node, such as the /var/log/cfn-init.log and /var/log/chef-client.log files. For more information about AWS ParallelCluster logs and how to view them, see Key logs for debugging and Retrieving and preserving logs.

If you don't have a failureCode, navigate to the AWS CloudFormation console to view the cluster stack. Check the Status Reason for the HeadNodeWaitCondition or failures on other resources to find additional failure details. For more information, see View AWS CloudFormation events on CREATE_FAILED. Check the /var/log/cfn-init.log and /var/log/chef-client.log files on the head node. If cluster creation fails because of head node creation failure and the cluster logs are not available in the cluster log group, you must retain the cluster on failure, specify --rollback-on-failure = True and retrieve the logs from within the head node itself.

`failureCode` is `OnNodeConfiguredExecutionFailure`

Why did it fail?

You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the custom script failed to run.
How to resolve?

Check the /var/log/cfn-init.log file to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to the OnNodeConfigured script after the Running command runpostinstall message.

`failureCode` is `OnNodeConfiguredDownloadFailure`

Why did it fail?

You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the custom script failed to download.
How to resolve?

Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.

Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeConfigured script processing, including downloading, after the Running command runpostinstall message.

`failureCode` is `OnNodeConfiguredFailure`

Why did it fail?

You provided a custom script in OnNodeConfigured of the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed.
How to resolve?

Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeConfigured script processing after the Running command runpostinstall message.

`failureCode` is `OnNodeStartExecutionFailure`

Why did it fail?

You provided a custom script in OnNodeStart of the head node section in the configuration to create a cluster. However, the custom script failed to run.
How to resolve?

Check the /var/log/cfn-init.log file to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to the OnNodeStart script after the Running command runpreinstall message.

`failureCode` is `OnNodeStartDownloadFailure`

Why did it fail?

You provided a custom script in OnNodeStart of the head node section in the configuration to create a cluster. However, the custom script failed to download.
How to resolve?

Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.

Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeStart script processing, including downloading, after the Running command runpreinstall message.

`failureCode` is `OnNodeStartFailure`

Why did it fail?

You provided a custom script in the OnNodeStart of the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed.
How to resolve?

Check the /var/log/cfn-init.logfile. Near the end of this log, you might see run information related to OnNodeStart script processing after the Running command runpreinstall message.

`failureCode` is `EbsMountFailure`

Why did it fail?

The EBS volume defined in the cluster configuration failed to mount.
How to resolve?

Check the /var/log/chef-client.log file for failure details.

`failureCode` is `EfsMountFailure`

Why did it fail?

The Amazon EFS volume defined in the cluster configuration failed to mount.
How to resolve?

If you defined an existing Amazon EFS file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / EfsSettings / FileSystemId.

Check the /var/log/chef-client.log file for failure details.

`failureCode` is `FsxMountFailure`

Why did it fail?

The Amazon FSx file system defined in the cluster configuration failed to mount.
How to resolve?

If you defined an existing Amazon FSx file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / FsxLustreSettings / FileSystemId.

Check the /var/log/chef-client.log file for failure details.

`failureCode` is `RaidMountFailure`

Why did it fail?

The RAID volumes defined in the cluster configuration failed to mount.
How to resolve?

Check the /var/log/chef-client.log file for failure details.

`failureCode` is `AmiVersionMismatch`

Why did it fail?

The AWS ParallelCluster version used to create the custom AMI is different than the AWS ParallelCluster version used to configure the cluster. In the CloudFormation console, view the cluster CloudFormation stack details and check the Status Reason for the HeadNodeWaitCondition to get additional details on the AWS ParallelCluster versions and the AMI. For more information, see View AWS CloudFormation events on CREATE_FAILED.
How to resolve?

Make sure the AWS ParallelCluster version used to create the custom AMI is the same AWS ParallelCluster version used to configure the cluster. You can change either the custom AMI version or the pcluster CLI version to make them the same.

`failureCode` is `InvalidAmi`

Why did it fail?

The custom AMI is invalid because it wasn't built using AWS ParallelCluster.
How to resolve?

Use the pcluster build-image command to create an AMI by making your AMI the parent image. For more information, see pcluster build-image.

`failureCode` is `HeadNodeBootstrapFailure` with `failureReason` Failed to set up the head node.

Why did it fail?

An immediate cause can't be determined and additional investigation is needed. For example, it could be that the cluster is in protected status, and this could be caused by a failure to provision the static compute fleet.
How to resolve?

Check the /var/log/chef-client.log. file for failure details.

Note
If you see RuntimeError exception Cluster state has been set to PROTECTED mode due to failures detected in static node provisioning, the cluster is in protected status. For more information, see How to debug protected mode.

`failureCode` is `HeadNodeBootstrapFailure` with `failureReason` Cluster creation timed out.

Why did it fail?

By default, there is a 30 minute time limit for cluster creation to complete. If cluster creation hasn't completed within this time frame, the cluster creation fails with a timeout error. The cluster creation can timeout for different reasons. For example, timeout failures can be caused by a head node creation failure, a network issue, custom scripts that take too long to run in the head node, an error in a custom script that runs in compute nodes, or long wait times for compute node provisioning. An immediate cause can't be determined and additional investigation is needed.
How to resolve?

Check the /var/log/cfn-init.log and /var/log/chef-client.log files for failure details. For more information about AWS ParallelCluster logs and how to get them, see Key logs for debugging and Retrieving and preserving logs.
You might discover the following in these logs.
- Seeing Waiting for static fleet capacity provisioning near the end of the chef-client.log
  
  This indicates that the cluster creation timed out when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.
- Seeing OnNodeConfigured or OnNodeStart head node script hasn't finished at the end of the cfn-init.log
  
  This indicates that the OnNodeConfigured or OnNodeStart custom script took a long time to run and caused a timeout error. Check your custom script for issues that might cause it to run for a long time. If your custom script requires a long time to run, consider changing the timeout limit by adding a DevSettings section to your cluster configuration file, as shown in the following example:
```
DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 1800 # default setting: 1800 seconds
```
- Can't find the logs, or the head node wasn't created successfully
  
  It's possible that the head node wasn't created successfully and the logs can't be found. In the CloudFormation console, view the cluster stack details to check for additional failure details.

`failureCode` is `HeadNodeBootstrapFailure` with `failureReason` Failed to bootstrap the head node.

Why did it fail?

An immediate cause can't be determined and additional investigation is needed.
How to resolve?

Check the /var/log/cfn-init.log and /var/log/chef-client.log files.

`failureCode` is `ResourceCreationFailure`

Why did it fail?

The creation of some resources failed during the cluster creation process. The failure can occur for various reasons. For example, resource creation failures can be caused by capacity issues or a misconfigured IAM policy.
How to resolve?

In the CloudFormation console, view the cluster stack to check for additional resource creation failure details.

`failureCode` is `ClusterCreationFailure`

Why did it fail?

An immediate cause can't be determined and additional investigation is needed.
How to resolve?

In the CloudFormation console, view the cluster stack and check the Status Reason for the HeadNodeWaitCondition to find additional failure details.

Check the /var/log/cfn-init.log and /var/log/chef-client.log files.

Seeing `WaitCondition timed out...` in CloudFormation stack

For more information, see failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out..

Seeing `Resource creation cancelled` in CloudFormation stack

For more information, see failureCode is ResourceCreationFailure.

Seeing `Failed to run cfn-init...` or other errors in the AWS CloudFormation stack

Check the /var/log/cfn-init.log and /var/log/chef-client.log for additional failure details.

Seeing `chef-client.log` ends with `INFO: Waiting for static fleet capacity provisioning`

This is related to cluster creation timeout when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.

Seeing `Failed to run preinstall or postinstall in cfn-init.log`

You have an OnNodeConfigured or OnNodeStart script in the cluster configuration HeadNode section. The script isn't working correctly. Check the /var/log/cfn-init.log file for custom script error details.

Seeing `This AMI was created with xxx, but is trying to be used with xxx...` in CloudFormation stack

For more information, see failureCode is AmiVersionMismatch.

Seeing `This AMI was not baked by AWS ParallelCluster...` in CloudFormation stack

For more information, see failureCode is InvalidAmi.

Seeing `pcluster create-cluster` command fails to run locally

Check the ~/.parallelcluster/pcluster-cli.log in your local file system for failure details.

Additional support

Follow the troubleshooting guidance in Troubleshooting cluster deployment issues.

Check to see if your scenario is covered in GitHub Known Issues at AWS ParallelCluster on GitHub.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

AWS ParallelCluster troubleshooting

Trying to run a job

Trying to create a cluster

failureCode is OnNodeConfiguredExecutionFailure

failureCode is OnNodeConfiguredDownloadFailure

failureCode is OnNodeConfiguredFailure

failureCode is OnNodeStartExecutionFailure

failureCode is OnNodeStartDownloadFailure

failureCode is OnNodeStartFailure

failureCode is EbsMountFailure

failureCode is EfsMountFailure

failureCode is FsxMountFailure

failureCode is RaidMountFailure

failureCode is AmiVersionMismatch

failureCode is InvalidAmi

failureCode is HeadNodeBootstrapFailure with failureReason Failed to set up the head node.

Note

failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out.

failureCode is HeadNodeBootstrapFailure with failureReason Failed to bootstrap the head node.

failureCode is ResourceCreationFailure

failureCode is ClusterCreationFailure

Seeing WaitCondition timed out... in CloudFormation stack

Seeing Resource creation cancelled in CloudFormation stack

Seeing Failed to run cfn-init... or other errors in the AWS CloudFormation stack

Seeing chef-client.log ends with INFO: Waiting for static fleet capacity provisioning

Seeing Failed to run preinstall or postinstall in cfn-init.log

Seeing This AMI was created with xxx, but is trying to be used with xxx... in CloudFormation stack

Seeing This AMI was not baked by AWS ParallelCluster... in CloudFormation stack

Seeing pcluster create-cluster command fails to run locally