Trying to create a cluster
When using AWS ParallelCluster version 3.5.0 and later to create a cluster, and a cluster creation failed with
--rollback-on-failure set to false, use the pcluster describe-cluster CLI command to get status and failure information. In this case, the expected
clusterStatus of the pcluster describe-cluster output is CREATE_FAILED. Check the failures
section in the output to find the failureCode and failureReason. Then, in the following section, find the matching
failureCode for additional troubleshooting help. For more information, see pcluster describe-cluster.
In the following sections, we recommend that you check the logs on the head node, such as the /var/log/cfn-init.log and
/var/log/chef-client.log files. For more information about AWS ParallelCluster logs and how to view them, see Key logs for debugging and Retrieving and preserving logs.
If you don't have a failureCode, navigate to the AWS CloudFormation console to view the cluster stack. Check the Status Reason for the
HeadNodeWaitCondition or failures on other resources to find additional failure details. For more information, see
View AWS CloudFormation events on CREATE_FAILED.
Check the /var/log/cfn-init.log and /var/log/chef-client.log files on the head node.
If cluster creation fails because of head node creation failure and the cluster logs are not available in the cluster log
group, you must retain the cluster on failure, specify --rollback-on-failure = True and retrieve
the logs from within the head node itself.
failureCode is OnNodeConfiguredExecutionFailure
-
Why did it fail?
You provided a custom script in
OnNodeConfiguredof the head node section in the configuration to create a cluster. However, the custom script failed to run. -
How to resolve?
Check the
/var/log/cfn-init.logfile to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to theOnNodeConfiguredscript after theRunning command runpostinstallmessage.
failureCode is OnNodeConfiguredDownloadFailure
-
Why did it fail?
You provided a custom script in
OnNodeConfiguredof the head node section in the configuration to create a cluster. However, the custom script failed to download. -
How to resolve?
Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.
Check the
/var/log/cfn-init.logfile. Near the end of this log, you might see run information related toOnNodeConfiguredscript processing, including downloading, after theRunning command runpostinstallmessage.
failureCode is OnNodeConfiguredFailure
-
Why did it fail?
You provided a custom script in
OnNodeConfiguredof the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed. -
How to resolve?
Check the
/var/log/cfn-init.logfile. Near the end of this log, you might see run information related toOnNodeConfiguredscript processing after theRunning command runpostinstallmessage.
failureCode is OnNodeStartExecutionFailure
-
Why did it fail?
You provided a custom script in
OnNodeStartof the head node section in the configuration to create a cluster. However, the custom script failed to run. -
How to resolve?
Check the
/var/log/cfn-init.logfile to learn more about the failure and how to fix the issue in your custom script. Near the end of this log, you might see run information related to theOnNodeStartscript after theRunning command runpreinstallmessage.
failureCode is OnNodeStartDownloadFailure
-
Why did it fail?
You provided a custom script in
OnNodeStartof the head node section in the configuration to create a cluster. However, the custom script failed to download. -
How to resolve?
Make sure that the URL is valid and that the access is correctly configured. For more information on the configuration of custom bootstrap scripts, see Custom bootstrap actions.
Check the
/var/log/cfn-init.logfile. Near the end of this log, you might see run information related toOnNodeStartscript processing, including downloading, after theRunning command runpreinstallmessage.
failureCode is OnNodeStartFailure
-
Why did it fail?
You provided a custom script in the
OnNodeStartof the head node section in the configuration to create a cluster. However, the use of the custom script failed in the cluster deployment. An immediate cause can't be determined and additional investigation is needed. -
How to resolve?
Check the
/var/log/cfn-init.logfile. Near the end of this log, you might see run information related toOnNodeStartscript processing after theRunning command runpreinstallmessage.
failureCode is EbsMountFailure
-
Why did it fail?
The EBS volume defined in the cluster configuration failed to mount.
-
How to resolve?
Check the
/var/log/chef-client.logfile for failure details.
failureCode is EfsMountFailure
-
Why did it fail?
The Amazon EFS volume defined in the cluster configuration failed to mount.
-
How to resolve?
If you defined an existing Amazon EFS file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / EfsSettings / FileSystemId.
Check the
/var/log/chef-client.logfile for failure details.
failureCode is FsxMountFailure
-
Why did it fail?
The Amazon FSx file system defined in the cluster configuration failed to mount.
-
How to resolve?
If you defined an existing Amazon FSx file system, make sure that traffic is allowed between the cluster and the file system. For more information, see SharedStorage / FsxLustreSettings / FileSystemId.
Check the
/var/log/chef-client.logfile for failure details.
failureCode is RaidMountFailure
-
Why did it fail?
The RAID volumes defined in the cluster configuration failed to mount.
-
How to resolve?
Check the
/var/log/chef-client.logfile for failure details.
failureCode is AmiVersionMismatch
-
Why did it fail?
The AWS ParallelCluster version used to create the custom AMI is different than the AWS ParallelCluster version used to configure the cluster. In the CloudFormation console, view the cluster CloudFormation stack details and check the
Status Reasonfor theHeadNodeWaitConditionto get additional details on the AWS ParallelCluster versions and the AMI. For more information, see View AWS CloudFormation events on CREATE_FAILED. -
How to resolve?
Make sure the AWS ParallelCluster version used to create the custom AMI is the same AWS ParallelCluster version used to configure the cluster. You can change either the custom AMI version or the
pclusterCLI version to make them the same.
failureCode is InvalidAmi
-
Why did it fail?
The custom AMI is invalid because it wasn't built using AWS ParallelCluster.
-
How to resolve?
Use the
pcluster build-imagecommand to create an AMI by making your AMI the parent image. For more information, see pcluster build-image.
failureCode is HeadNodeBootstrapFailure with failureReason Failed to set up the head node.
-
Why did it fail?
An immediate cause can't be determined and additional investigation is needed. For example, it could be that the cluster is in protected status, and this could be caused by a failure to provision the static compute fleet.
-
How to resolve?
Check the
/var/log/chef-client.log.file for failure details.Note
If you see
RuntimeErrorexceptionCluster state has been set to PROTECTED mode due to failures detected in static node provisioning, the cluster is in protected status. For more information, see How to debug protected mode.
failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out.
-
Why did it fail?
By default, there is a 30 minute time limit for cluster creation to complete. If cluster creation hasn't completed within this time frame, the cluster creation fails with a timeout error. The cluster creation can timeout for different reasons. For example, timeout failures can be caused by a head node creation failure, a network issue, custom scripts that take too long to run in the head node, an error in a custom script that runs in compute nodes, or long wait times for compute node provisioning. An immediate cause can't be determined and additional investigation is needed.
-
How to resolve?
Check the
/var/log/cfn-init.logand/var/log/chef-client.logfiles for failure details. For more information about AWS ParallelCluster logs and how to get them, see Key logs for debugging and Retrieving and preserving logs.You might discover the following in these logs.
-
Seeing
Waiting for static fleet capacity provisioningnear the end of thechef-client.logThis indicates that the cluster creation timed out when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.
-
Seeing
OnNodeConfiguredorOnNodeStarthead node script hasn't finished at the end of thecfn-init.logThis indicates that the
OnNodeConfiguredorOnNodeStartcustom script took a long time to run and caused a timeout error. Check your custom script for issues that might cause it to run for a long time. If your custom script requires a long time to run, consider changing the timeout limit by adding aDevSettingssection to your cluster configuration file, as shown in the following example:DevSettings: Timeouts: HeadNodeBootstrapTimeout: 1800 # default setting: 1800 seconds -
Can't find the logs, or the head node wasn't created successfully
It's possible that the head node wasn't created successfully and the logs can't be found. In the CloudFormation console, view the cluster stack details to check for additional failure details.
-
failureCode is HeadNodeBootstrapFailure with failureReason Failed to bootstrap the head node.
-
Why did it fail?
An immediate cause can't be determined and additional investigation is needed.
-
How to resolve?
Check the
/var/log/cfn-init.logand/var/log/chef-client.logfiles.
failureCode is ResourceCreationFailure
-
Why did it fail?
The creation of some resources failed during the cluster creation process. The failure can occur for various reasons. For example, resource creation failures can be caused by capacity issues or a misconfigured IAM policy.
-
How to resolve?
In the CloudFormation console, view the cluster stack to check for additional resource creation failure details.
failureCode is ClusterCreationFailure
-
Why did it fail?
An immediate cause can't be determined and additional investigation is needed.
-
How to resolve?
In the CloudFormation console, view the cluster stack and check the
Status Reasonfor theHeadNodeWaitConditionto find additional failure details.Check the
/var/log/cfn-init.logand/var/log/chef-client.logfiles.
Seeing WaitCondition timed out... in CloudFormation stack
For more information, see failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out..
Seeing Resource creation cancelled in CloudFormation stack
For more information, see failureCode is ResourceCreationFailure.
Seeing Failed to run cfn-init... or other errors in the AWS CloudFormation stack
Check the /var/log/cfn-init.log and /var/log/chef-client.log for additional failure details.
Seeing chef-client.log ends with INFO: Waiting for static fleet capacity provisioning
This is related to cluster creation timeout when waiting for static nodes to power up. For more information, see Seeing errors in compute node initializations.
Seeing Failed to run preinstall or postinstall in cfn-init.log
You have an OnNodeConfigured or OnNodeStart script in the cluster configuration HeadNode section.
The script isn't working correctly. Check the /var/log/cfn-init.log file for custom script error details.
Seeing This AMI was created with xxx, but is trying to be used with xxx... in CloudFormation stack
For more information, see failureCode is AmiVersionMismatch.
Seeing This AMI was not baked by AWS ParallelCluster... in CloudFormation stack
For more information, see failureCode is InvalidAmi.
Seeing pcluster create-cluster command fails to run locally
Check the ~/.parallelcluster/pcluster-cli.log in your local file system for failure details.
Additional support
Follow the troubleshooting guidance in Troubleshooting cluster deployment issues.
Check to see if your scenario is covered in GitHub Known Issues