View CloudFormation events on CREATE_FAILED Use the CLI to view log streams Re-create the failed cluster with rollback-on-failure

Troubleshooting cluster deployment issues

If your cluster fails to be created and rolls back stack creation, you can look through the log files to diagnose the issue. The failure message likely looks like the following output:


$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \
 --cluster-configuration cluster-config.yaml
{
  "cluster": {
    "clusterName": "mycluster",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
    "region": "eu-west-1",
    "version": "3.15.1",
    "clusterStatus": "CREATE_IN_PROGRESS"
  }
}

$ pcluster describe-cluster --cluster-name mycluster --region eu-west-1
{
  "creationTime": "2021-09-06T11:03:47.696Z",
  ...
  "cloudFormationStackStatus": "ROLLBACK_IN_PROGRESS",
  "clusterName": "mycluster",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
  "lastUpdatedTime": "2021-09-06T11:03:47.696Z",
  "region": "eu-west-1",
  "clusterStatus": "CREATE_FAILED"
}

Topics

View CloudFormation events on CREATE_FAILED
Use the CLI to view log streams
Re-create the failed cluster with rollback-on-failure

View CloudFormation events on `CREATE_FAILED`

You can use the console or the AWS ParallelCluster CLI to view CloudFormation events on CREATE_FAILED errors to help find the root cause.

Topics

View events in the CloudFormation console
Use the CLI to view and filter CloudFormation events on CREATE_FAILED

View events in the CloudFormation console

To see more information about what caused the "CREATE_FAILED" status, you can use the CloudFormation console.

View CloudFormation error messages from the console.

Log in to the AWS Management Console and navigate to https://console.aws.amazon.com/cloudformation.
Select the stack named cluster_name.
Choose the Events tab.
Check the Status for the resource that failed to create by scrolling through the list of resource events by Logical ID. If a subtask failed to create, work backwards to find the failed resource event.

As an example, if you see the following status message, you must use instance types that won't exceed your current vCPU limit or request more vCPU capacity.


2022-02-04 16:09:44 UTC-0800	HeadNode	CREATE_FAILED	You have requested more vCPU capacity than your current vCPU limit of 0 allows 
     for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. 
     (Service: AmazonEC2; Status Code: 400; Error Code: VcpuLimitExceeded; Request ID: a9876543-b321-c765-d432-dcba98766789; Proxy: null).

Use the CLI to view and filter CloudFormation events on `CREATE_FAILED`

To diagnose the cluster creation issue, you can use the pcluster get-cluster-stack-events command by filtering for CREATE_FAILED status. For more information, see Filtering AWS CLI output in the AWS Command Line Interface User Guide.


$ pcluster get-cluster-stack-events --cluster-name mycluster --region eu-west-1 \
    --query 'events[?resourceStatus==`CREATE_FAILED`]'
  [
    {
      "eventId": "3ccdedd0-0f03-11ec-8c06-02c352fe2ef9",
      "physicalResourceId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "resourceStatus": "CREATE_FAILED",
      "resourceStatusReason": "The following resource(s) failed to create: [HeadNode]. ",
      "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "stackName": "mycluster",
      "logicalResourceId": "mycluster",
      "resourceType": "AWS::CloudFormation::Stack",
      "timestamp": "2021-09-06T11:11:51.780Z"
    },
    {
      "eventId": "HeadNode-CREATE_FAILED-2021-09-06T11:11:50.127Z",
      "physicalResourceId": "i-04e91cc1f4ea796fe",
      "resourceStatus": "CREATE_FAILED",
      "resourceStatusReason": "Received FAILURE signal with UniqueId i-04e91cc1f4ea796fe",
      "resourceProperties": "{\"LaunchTemplate\":{\"Version\":\"1\",\"LaunchTemplateId\":\"lt-057d2b1e687f05a62\"}}",
      "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",
      "stackName": "mycluster",
      "logicalResourceId": "HeadNode",
      "resourceType": "AWS::EC2::Instance",
      "timestamp": "2021-09-06T11:11:50.127Z"
    }
  ]

In the previous example, the failure was in the head node setup.

Use the CLI to view log streams

To debug this kind of issue, you can list the log streams available from the head node with the pcluster list-cluster-log-streams by filtering for node-type and then analyzing the log streams content.


$ pcluster list-cluster-log-streams --cluster-name mycluster --region eu-west-1 \
--filters 'Name=node-type,Values=HeadNode'
{
  "logStreams": [
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init",
      ...
    },
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client",
      ...
    },
    {
      "logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init",
      "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init",
      ...
    },
    ...
  ]
}

The two primary log streams that you can use to find initialization errors are the following:

cfn-init is the log for the cfn-init script. First check this log stream. You're likely to see the Command chef failed error in this log. Look at the lines immediately before this line for more specifics connected with the error message. For more information, see cfn-init.
cloud-init is the log for cloud-init. If you don't see anything in cfn-init, then try checking this log next.

You can retrieve the content of the log stream by using the pcluster get-cluster-log-events (note the --limit 5 option to limit the number of retrieved events):


$ pcluster get-cluster-log-events --cluster-name mycluster \
  --region eu-west-1 --log-stream-name ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init \
  --limit 5
{
  "nextToken": "f/36370880979637159565202782352491087067973952362220945409/s",
  "prevToken": "b/36370880752972385367337528725601470541902663176996585497/s",
  "events": [
    {
      "message": "2021-09-06 11:11:39,049 [ERROR] Unhandled exception during build: Command runpostinstall failed",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "Traceback (most recent call last):\n  File \"/opt/aws/bin/cfn-init\", line 176, in <module>\n    worklog.build(metadata, configSets)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 135, in build\n    Contractor(metadata).build(configSets, self)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 561, in build\n    self.run_config(config, worklog)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 573, in run_config\n    CloudFormationCarpenter(config, self._auth_config).build(worklog)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 273, in build\n    self._config.commands)\n  File \"/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py\", line 127, in apply\n    raise ToolError(u\"Command %s failed\" % name)",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "cfnbootstrap.construction_errors.ToolError: Command runpostinstall failed",
      "timestamp": "2021-09-06T11:11:39.049Z"
    },
    {
      "message": "2021-09-06 11:11:49,212 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-west-1.amazonaws.com",
      "timestamp": "2021-09-06T11:11:49.212Z"
    },
    {
      "message": "2021-09-06 11:11:49,213 [DEBUG] Signaling resource HeadNode in stack mycluster with unique ID i-04e91cc1f4ea796fe and status FAILURE",
      "timestamp": "2021-09-06T11:11:49.213Z"
    }
  ]
}

In the previous example, the failure is caused by a runpostinstall failure, so it is strictly related to the content of the custom bootstrap script used in the OnNodeConfigured configuration parameter of the CustomActions.

Re-create the failed cluster with `rollback-on-failure`

AWS ParallelCluster creates cluster CloudWatch log streams in log groups. You can view these logs in the CloudWatch console Custom Dashboards or Log groups. For more information, see Integration with Amazon CloudWatch Logs and Amazon CloudWatch dashboard. If there are no log streams available, the failure might be caused by the CustomActions custom bootstrap script or an AMI-related issue. To diagnose the creation issue in this case, create the cluster again using pcluster create-cluster, including the --rollback-on-failure parameter set to false. Then, use SSH to view the cluster, as shown in the following:


$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \
   --cluster-configuration cluster-config.yaml --rollback-on-failure false
 {
   "cluster": {
     "clusterName": "mycluster",
     "cloudformationStackStatus": "CREATE_IN_PROGRESS",
     "cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",
     "region": "eu-west-1",
     "version": "3.15.1",
     "clusterStatus": "CREATE_IN_PROGRESS"
   }
 } 
 $ pcluster ssh --cluster-name mycluster

After you're logged into the head node, you should find three primary log files that you can use to find the error.

/var/log/cfn-init.log is the log for the cfn-init script. First check this log. You're likely to see an error such as Command chef failed in this log. Look at the lines immediately before this line for more specifics connected with the error message. For more information, see cfn-init.
/var/log/cloud-init.log is the log for cloud-init. If you don't see anything in cfn-init.log, then try checking this log next.
/var/log/cloud-init-output.log is the output of commands that were run by cloud-init. This includes the output from cfn-init. In most cases, you don't need to look at this log to troubleshoot this type of issue.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshooting cluster health metrics

Troubleshooting cluster deployment using Terraform

Troubleshooting cluster deployment issues

Topics

View CloudFormation events on CREATE_FAILED

Topics

View events in the CloudFormation console

View CloudFormation error messages from the console.

Use the CLI to view and filter CloudFormation events on CREATE_FAILED

Use the CLI to view log streams

Re-create the failed cluster with rollback-on-failure

View CloudFormation events on `CREATE_FAILED`

Use the CLI to view and filter CloudFormation events on `CREATE_FAILED`

Re-create the failed cluster with `rollback-on-failure`