

# Troubleshooting Amazon FSx for Lustre
Troubleshooting

This section covers various troubleshooting scenarios and solutions for Amazon FSx for Lustre file systems.

If you encounter problems not listed following, try asking a question in the [Amazon FSx for Lustre forum](https://forums.aws.amazon.com/forum.jspa?forumID=311).

**Topics**
+ [

# Creating an FSx for Lustre file system fails
](cannot-create-fs.md)
+ [

# Troubleshooting file system mount issues
](mount-troubleshooting.md)
+ [

# You cannot access your file system
](cant-access-fs.md)
+ [

# Unable to validate access to an S3 bucket when creating a DRA
](s3-validation-error.md)
+ [

# Renaming directories takes a long time
](rename-directory-time.md)
+ [

# Troubleshooting a misconfigured linked S3 bucket
](troubleshooting-misconfigured-data-repository.md)
+ [

# Troubleshooting storage issues
](lfs-migrate-ts.md)
+ [

# Troubleshooting FSx for Lustre CSI driver issues
](csi-driver-issues.md)

# Creating an FSx for Lustre file system fails
Creating a file system fails

There are a number of potential causes when a file system creation request fails, as described in the following topics.

## Cannot create an EFA-enabled file system because of misconfigured security group


Creating an FSx for Lustre EFA-enabled file system fails with the following error message:

```
Insufficient security group permissions to create an EFA-enabled file system.
Update security group to allow all internal inbound and outbound traffic.
```

**Action to take**

Make sure that the VPC security group you are using for the creation operation is configured as described in [EFA-enabled security groups](limit-access-security-groups.md#efa-security-groups). An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself and the security group of the clients if clients reside in a different security group.

## Cannot create a file system because of misconfigured security group


Creating an FSx for Lustre file system fails with the following error message:

```
The file system cannot be created because the default security group in the subnet provided
or the provided security groups do not permit Lustre LNET network traffic on port 988
```

**Action to take**

Make sure that the VPC security group you are using for the creation operation is configured as described in [File system access control with Amazon VPC](limit-access-security-groups.md). You must set up the security group to allow inbound traffic on ports 988 and 1018-1023 from the security group itself or the full subnet CIDR, which is required to allow the file system hosts to communicate with each other.

## Cannot create a file system because of insufficient capacity errors


You may receive an insufficient capacity error when attempting to create a new file system, update storage capacity, or modify throughput capacity.

**Cause**

This error occurs when FSx for Lustre does not currently have enough available hardware capacity in the requested Availability Zone to fulfill your request.

**Solution**

To resolve the issue, try the following:
+ Wait a few minutes and retry your request, as capacity availability changes frequently.
+ Try your request in a different Availability Zone.
+ Attempt the operation with a smaller storage size or lower throughput level

## Cannot create a file system that is linked to an S3 bucket


If creating a new file system that is linked to an S3 bucket fails with an error message similar to the following.

```
User: arn:aws:iam::012345678901:user/username is not authorized to perform: iam:PutRolePolicy on resource: resource ARN
```

This error can happen if you try to create a file system linked to an Amazon S3 bucket without the necessary IAM permissions. The required IAM permissions support the Amazon FSx for Lustre service-linked role that is used to access the specified Amazon S3 bucket on your behalf.

**Action to take**

Ensure that your IAM entity (user, group, or role) has the appropriate permissions to create file systems. Doing this includes adding the permissions policy that supports the Amazon FSx for Lustre service-linked role. For more information, see [Adding permissions to use data repositories in Amazon S3](setting-up.md#fsx-adding-permissions-s3).

For more information about service-linked roles, see [Using service-linked roles for Amazon FSx](using-service-linked-roles.md).

# Troubleshooting file system mount issues
File system mount fails

There are a number of potential causes when a file system mount command fails, as described in the following topics.

## File system mount fails right away


The file system mount command fails right away. The following code shows an example.

```
mount.lustre: mount fs-0123456789abcdef0.fsx.us-east-1.aws@tcp:/fsx at /lustre
failed: No such file or directory

Is the MGS specification correct?
Is the filesystem name correct?
```

This error can occur if you aren't using the correct `mountname` value when mounting a persistent or scratch 2 file system by using the **mount** command. You can get the `mountname` value from the response of the [https://docs.aws.amazon.com/cli/latest/reference/fsx/describe-file-systems.html](https://docs.aws.amazon.com/cli/latest/reference/fsx/describe-file-systems.html) AWS CLI command or the [https://docs.aws.amazon.com/fsx/latest/APIReference/API_DescribeFileSystems.html](https://docs.aws.amazon.com/fsx/latest/APIReference/API_DescribeFileSystems.html) API operation.

## File system mount hangs and then fails with timeout error


The file system mount command hangs for a minute or two, and then fails with a timeout error. 

The following code shows an example.

```
sudo mount -t lustre file_system_dns_name@tcp:/mountname /mnt/fsx

[2+ minute wait here]
Connection timed out
```

This error can occur because the security groups for the Amazon EC2 instance or the file system aren't configured properly.

**Action to take**

Make sure that your security groups for the file system have the inbound rules specified in [Amazon VPC Security Groups](limit-access-security-groups.md#fsx-vpc-security-groups).

## Automatic mounting fails and the instance is unresponsive




In some cases, automatic mounting might fail for a file system and your Amazon EC2 instance might stop responding.

This issue can occur if the `_netdev` option wasn't declared. If `_netdev` is missing, your Amazon EC2 instance can stop responding. This result is because network file systems need to be initialized after the compute instance starts its networking.

**Action to take**  
If this issue occurs, contact AWS Support.

## File system mount fails during system boot


The file system mount fails during the system boot. The mounting is automated using `/etc/fstab`. When the file system is not mounted, the following error is seen in the syslog for the instance booting time frame.

```
LNetError: 3135:0:(lib-socket.c:583:lnet_sock_listen()) Can't create socket: port 988 already in use
LNetError: 122-1: Can't start acceptor on port 988: port already in use
```

This error can occur when port 988 is not available. When the instance is configured to mount NFS file systems, it is possible that the NFS mounts will bind its client port to port 988

**Action to take**

You can work around this problem by tuning the NFS client's `noresvport` and `noauto` mount options where possible.

## File system mount using DNS name fails


Misconfigured Domain Name Service (DNS) names can cause file system mount failures, as shown in the following scenarios.

**Scenario 1:** A file system mount that is using a Domain Name Service (DNS) name fails. The following code shows an example.

```
sudo mount -t lustre file_system_dns_name@tcp:/mountname /mnt/fsx
mount.lustre: Can't parse NID 
'file_system_dns_name@tcp:/mountname'
```

**Action to take**

Check your virtual private cloud (VPC) configuration. If you are using a custom VPC, make sure that DNS settings are enabled. For more information, see [Using DNS with Your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html) in the *Amazon VPC User Guide*. 

To specify a DNS name in the `mount` command, do the following:
+ Ensure that the Amazon EC2 instance is in the same VPC as your Amazon FSx for Lustre file system.
+ Connect your Amazon EC2 instance inside a VPC configured to use the DNS server provided by Amazon. For more information, see [DHCP Options Sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html) in the *Amazon VPC User Guide*.
+ Ensure that the Amazon VPC of the connecting Amazon EC2 instance has DNS host names enabled. For more information, see [Updating DNS Support for Your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-updating) in the *Amazon VPC User Guide*.

**Scenario 2:** A file system mount that is using a Domain Name Service (DNS) name fails. The following code shows an example.

```
mount -t lustre file_system_dns_name@tcp:/mountname /mnt/fsx
mount.lustre: mount file_system_dns_name@tcp:/mountname at /mnt/fsx failed: Input/output error Is the MGS running?
```

**Action to take**

 Make sure that the client's VPC security groups have the correct outbound traffic rules applied. This recommendation holds true especially if you aren't using the default security group, or if you have modified the default security group. For more information, see [Amazon VPC Security Groups](limit-access-security-groups.md#fsx-vpc-security-groups). 

# You cannot access your file system


There are a number of potential causes for being unable to access your file system, each with their own resolution, as follows.

## The Elastic IP address attached to the file system elastic network interface was deleted


Amazon FSx doesn't support accessing file systems from the public Internet. Amazon FSx automatically detaches any Elastic IP address, which is a public IP address reachable from the Internet, that gets attached to a file system's elastic network interface.

## The file system elastic network interface was modified or deleted


You must not modify or delete the file system's elastic network interface. Modifying or deleting the network interface can cause a permanent loss of connection between your VPC and your file system. Create a new file system, and do not modify or delete the FSx elastic network interface. For more information, see [File system access control with Amazon VPC](limit-access-security-groups.md).

# Unable to validate access to an S3 bucket when creating a DRA
Creating a DRA fails

Creating a data repository association (DRA) from the Amazon FSx console or using the `create-data-repository-association` CLI command ([CreateDataRepositoryAssociation](https://docs.aws.amazon.com/fsx/latest/APIReference/API_CreateDataRepositoryAssociation.html) is the equivalent API action) fails with the following error message.

```
Amazon FSx is unable to validate access to the S3 bucket. Ensure the IAM role or user
you are using has s3:Get*, s3:List* and s3:PutObject permissions to the S3 bucket prefix.
```

**Note**  
You can also get the above error when creating a Scratch 1, Scratch 2, or Persistent 1 file system that is linked to a data repository (S3 bucket or prefix) using the Amazon FSx console or the `create-file-system` CLI command ([CreateFileSystem](https://docs.aws.amazon.com/fsx/latest/APIReference/API_CreateFileSystem.html) is the equivalent API action).

**Action to take**

If the FSx for Lustre file system is in the same account as the S3 bucket, this error means the IAM role you used for the create request doesn't have the necessary permissions to access the S3 bucket. Make sure the IAM role has the permissions listed in the error message. These permissions support the Amazon FSx for Lustre service-linked role that is used to access the specified Amazon S3 bucket on your behalf.

If the FSx for Lustre file system is in a different account as the S3 bucket (cross-account case), in additional to making sure the IAM role you used has the required permissions, the S3 bucket policy should be configured to allow the access from the account that the FSx for Lustre is created in.

For more information about S3 cross-account bucket permissions, see [Example 2: Bucket owner granting cross-account bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) in the *Amazon Simple Storage Service User Guide*.

# Renaming directories takes a long time


**Question**

I renamed a directory on a file system linked to an Amazon S3 bucket and have automatic export enabled. Why are the files inside this directory taking a long time to get renamed on the S3 bucket?

**Answer**

When you rename a directory on the file system, FSx for Lustre creates new S3 objects for all of the files and directories inside the directory that was renamed. The amount of time it takes to propagate the directory rename to S3 is directly correlated to the quantity of files and directories that are descendants of the directory being renamed.

# Troubleshooting a misconfigured linked S3 bucket
Misconfigured linked S3 bucket

In some cases, an FSx for Lustre file system's linked S3 bucket might have a misconfigured data repository lifecycle state. 

**Possible cause**

This error can occur if Amazon FSx does not have the necessary AWS Identity and Access Management (IAM) permissions that are required to access the linked data repository. The required IAM permissions support the Amazon FSx for Lustre service-linked role that is used to access the specified Amazon S3 bucket on your behalf.

**Action to take**

1. Ensure that your IAM entity (user, group, or role) has the appropriate permissions to create file systems. Doing this includes adding the permissions policy that supports the Amazon FSx for Lustre service-linked role. For more information, see [Adding permissions to use data repositories in Amazon S3](setting-up.md#fsx-adding-permissions-s3).

1. Using the Amazon FSx CLI or API, refresh the file system's `AutoImportPolicy` with the `update-file-system` CLI command ([UpdateFileSystem](https://docs.aws.amazon.com/fsx/latest/APIReference/API_UpdateFileSystem.html) is the equivalent API action), as follows. 

   ```
   aws fsx update-file-system \
   --file-system-id fs-0123456789abcdef0 \
   --lustre-configuration AutoImportPolicy=the_existing_AutoImportPolicy
   ```

For more information about service-linked roles, see [Using service-linked roles for Amazon FSx](using-service-linked-roles.md).

**Possible Cause**

This error can occur if the linked Amazon S3 data repository has an existing event notification configuration with event types that overlap with the Amazon FSx event notification configuration (`s3:ObjectCreated:*`, `s3:ObjectRemoved:*`).

This can also occur if the Amazon FSx event notification configuration on the linked S3 bucket was deleted or modified.

**Action to take**

1. Remove any existing event notification on the linked S3 bucket that uses either or both of the event types that the FSx event configuration uses, `s3:ObjectCreated:*` and `s3:ObjectRemoved:*`.

1. Ensure that there is an S3 Event Notification Configuration in your linked S3 bucket with the name `FSx`, event types `s3:ObjectCreated:*` and `s3:ObjectRemoved:*`, and send to the SNS topic with `ARN:topic_arn_returned_in_API_response`.

1. Reapply the FSx event notification configuration on the S3 bucket by using the Amazon FSx CLI or API, to refresh the file system's `AutoImportPolicy`. Do so with the `update-file-system` CLI command ([UpdateFileSystem](https://docs.aws.amazon.com/fsx/latest/APIReference/API_UpdateFileSystem.html) is the equivalent API action), as follows. 

   ```
   aws fsx update-file-system \
   --file-system-id fs-0123456789abcdef0 \
   --lustre-configuration AutoImportPolicy=the_existing_AutoImportPolicy
   ```

# Troubleshooting storage issues
Storage issues

In some cases, you may experience storage issues with your file system. You can troubleshoot these issues by using `lfs` commands, such as the `lfs migrate` command.

## Write error due to no space on storage target


You can check the storage usage of your file system by using the `lfs df -h` command, as described in [File system storage layout](performance.md#storage-layout). The `filesystem_summary` field reports the total file system storage usage.

If the file system disk usage is 100%, consider increasing the storage capacity of your file system. For more information, see [Managing storage capacity](managing-storage-capacity.md).

If the file system storage usage is not 100% and you still get write errors, the file you are writing to may be striped on an OST that is full.

**Action to take**
+ If many of your OSTs are full, increase the storage capacity of your file system. Check for unbalanced storage on OSTs by following the actions of the [Unbalanced storage on OSTs](#lfs-migrate-unbalanced-storage) section.
+ If your OSTs are not full, tune the client dirty page buffer size by applying the following tuning to all your client instances:

  ```
  sudo lctl set_param osc.*.max_dirty_mb=64
  ```

## Unbalanced storage on OSTs


Amazon FSx for Lustre distributes new file stripes evenly across OSTs. However, your file system may still become unbalanced due to I/O patterns or file storage layout. As a result, some storage targets can become full while others remain relatively empty.

You use the `lfs migrate` command to move files or directories from more-full to less-full OSTs. You can use the `lfs migrate` command in either block or non-block mode.
+ **Block mode** is the default mode for the `lfs migrate` command. When run in block mode, `lfs migrate` first acquires a group lock on the files and directories before data migration to prevent modifications to the files, then releases the lock when migration completes. By preventing other processes from modifying the files, block mode prevents these processes from interrupting the migration. The downside is that preventing an application from modifying a file may result in delays or errors for the application.
+ **Non-block mode** is enabled for the `lfs migrate` command with the `-n` option. When running `lfs migrate` in non-block mode, other processes can still modify the files that are being migrated. If a process modifies a file before `lfs migrate` finishes migrating it, `lfs migrate` will fail to migrate that file, leaving the file with its original stripe layout.

We recommend you use non-block mode, as it is less likely to interfere with your application.

**Action to take**

1. Launch a relatively large client instance (such as the Amazon EC2 `c5n.4xlarge` instance type) to mount to the file system.

1. Before running the non-block mode script pr the block-mode script, first run the following commands on each client instance to speed up the process:

   ```
   sudo lctl set_param 'mdc.*.max_rpcs_in_flight=60'
   sudo lctl set_param 'mdc.*.max_mod_rpcs_in_flight=59'
   ```

1. Start a screen session and run the non-block mode script or the block mode script. Make sure to change the appropriate variables in the scripts:
   + Non-block mode script:

     ```
     #!/bin/bash
     
     # UNCOMMENT THE FOLLOWING LINES:
     #
     # TRY_COUNT=0
     # MAX_MIGRATE_ATTEMPTS=100
     # OSTS="fsname-OST0000_UUID"
     # DIR_OR_FILE_MIGRATED="/mnt/subdir/"
     # BATCH_SIZE=10
     # PARALLEL_JOBS=16 # up to max-procs processes, set to 16 if client is c5n.4xlarge with 16 vcpu
     # LUSTRE_STRIPING_CONFIG="-E 100M -c 1 -E 10G -c 8 -E 100G -c 16 -E -1 -c 32" # should be consistent with the existing striping setup
     #
     
     if [ -z "$TRY_COUNT" -o -z "$MAX_MIGRATE_ATTEMPTS" -o -z "$OSTS" -o -z "$DIR_OR_FILE_MIGRATED" -o -z "$BATCH_SIZE" -o -z "$PARALLEL_JOBS" -o -z "$LUSTRE_STRIPING_CONFIG" ]; then
         echo "Some variables are not set."
         exit 1
     fi
     
     echo "lfs migrate starts"
     while true; do
         output=$(sudo lfs find ! -L released --ost $OSTS --print0 $DIR_OR_FILE_MIGRATED | shuf -z | /bin/xargs -0 -P $PARALLEL_JOBS -n $BATCH_SIZE sudo lfs migrate -n $LUSTRE_STRIPING_CONFIG 2>&1)
         if [[ $? -eq 0 ]]; then
             echo "lfs migrate succeeds for $DIR_OR_FILE_MIGRATED at the $TRY_COUNT attempt, exiting."
             exit 0
         elif [[ $? -eq 123 ]]; then 
             echo "WARN: Target data objects are not located on these OSTs. Skipping lfs migrate"
             exit 1
         else
             echo "lfs migrate fails for $DIR_OR_FILE_MIGRATED at the $TRY_COUNT attempt, retrying..."
             if (( ++TRY_COUNT >= MAX_MIGRATE_ATTEMPTS )); then
                 echo "WARN: Exceeds max retry attempt. Skipping lfs migrate for $DIR_OR_FILE_MIGRATED. Failed with the following error"
                 echo $output
                 exit 1
             fi
         fi
     done
     ```
   + Block mode script:
     + Replace the values in `OSTS` with the values of your OSTs.
     + Provide an integer value to `nproc` to set the number of max-procs processes to run in parallel. For example, the Amazon EC2 `c5n.4xlarge` instance type has 16 vCPUs, so you can use `16` (or a value < 16) for `nproc`.
     + Provide your mount directory path in `mnt_dir_path`.

     ```
     # find all OSTs with usage above a certain threshold; for example, greater than or equal to 85% full 
     for OST in $(lfs df -h |egrep '( 8[5-9]| 9[0-9]|100)%'|cut -d' ' -f1); do echo ${OST};done|tr '\012' ','
     
     # customer can also just pass OST values directly to OSTS variable
     OSTS='dzfevbmv-OST0000_UUID,dzfevbmv-OST0002_UUID,dzfevbmv-OST0004_UUID,dzfevbmv-OST0005_UUID,dzfevbmv-OST0006_UUID,dzfevbmv-OST0008_UUID'
     
     nproc=<Run up to max-procs processes if client is c5n.4xlarge with 16 vcpu, this value can be set to 16>
     
     mnt_dir_path=<mount dir, e.g. '/my_mnt'>
     
     lfs find ${mnt_dir_path} --ost ${OSTS}| xargs -P ${nproc} -n2 lfs migrate -E 100M -c 1 -E 10G -c 8 -E 100G -c 16 -E -1 -c 32
     ```

**Notes**
+ If you notice that there is an impact on the performance of the reads of the file system, you can stop the migrations at any time by using `ctrl-c` or k`ill -9`, and reduce the number of threads (`nproc` value) back to a lower number (such as 8), and resume migrating files.
+ The `lfs migrate` command will fail on a file that is also opened by the client workload. It will throw an error and move to the next file; therefore, it is possible if there are many files being accessed, the script will not be able to migrate any files, and it will be reflected as the migration is making very slow progress.
+ You can monitor OST usage using either of the following methods
  + On client mount, run the following command to monitor OST usage and find the OST with usage greater than 85%:

    ```
    lfs df -h |egrep '( 8[5-9]| 9[1-9]|100)%'
    ```
  + Check the Amazon CloudWatch metric, `OST FreeDataStorageCapacity`, check `Minimum`. If your script is finding OSTs that are over 85% full, then when the metric is close to 15%, use `ctrl-c` or `kill -9` to stop the migration.
+ You may also consider changing the stripe configuration of your file system or a directory, so that new files are striped across multiple storage targets. For more information, see in [Striping data in your file system](performance.md#striping-data). 

# Troubleshooting FSx for Lustre CSI driver issues
CSI driver issues

Amazon FSx for Lustre supports access from containers running on Amazon EKS using the open-source FSx for Lustre CSI driver. For deployment information, see [Use Amazon FSx for Lustre Storage](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html) in the **Amazon EKS User Guide**.

If you’re experiencing issues with the FSx for Lustre CSI driver for containers running on Amazon EKS, see [Troubleshooting CSI Driver (Common Issues)](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/troubleshooting.md) which is available on GitHub.