# Managing high-availability (HA) pairs
<a name="HA-pairs"></a>

Each FSx for ONTAP file system is powered by one or more high-availability (HA) pairs of file servers in an active-standby configuration. In this configuration, there is a preferred file server that actively serves traffic and a secondary file server that takes over if the active server is unavailable. FSx for ONTAP first-generation file systems are powered by one HA pair, which delivers up to 4 GBps of throughput capacity and 160,000 SSD IOPs. FSx for ONTAP second-generation Multi-AZ file systems are powered by one HA pair as well, and they deliver up to 6 GBps of throughput capacity and 200,000 SSD IOPS. FSx for ONTAP second-generation Single-AZ file systems are powered by up to 12 HA pairs, which can deliver up to 72 GBps of throughput capacity and 2,400,000 SSD IOPS (6 GBps of throughput capacity and 200,000 SSD IOPS per HA pair). 

When you create your file system from the Amazon FSx console, Amazon FSx recommends the number of HA pairs that you should use based on your desired SSD storage. You can also manually choose the number of HA pairs based on your workload and performance requirements. We recommend that you use a single HA pair if your file system requirements are satisfied by up to 6 GBps of throughput capacity and 200,000 SSD IOPs, and multiple HA pairs if your workloads need higher levels of performance scalability. 

Each HA pair has one aggregate, which is a logical set of physical disks. 

**Note**  
You can add HA pairs to second-generation Single-AZ file systems. For more information, see [Adding high-availability (HA) pairs](adding-HA-pairs.md). Otherwise, you can migrate data between file systems (with different HA pairs) using SnapMirror, AWS DataSync, or by restoring your data from a backup to a new file system. 

# Adding high-availability (HA) pairs
<a name="adding-HA-pairs"></a>

FSx for ONTAP file systems are composed of one or more HA pairs of file servers. First-generation file systems and second-generation Multi-AZ file systems support one HA pair whereas second-generation Single-AZ file systems support up to 12 HA pairs. You can also add more HA pairs after creating a second-generation Single-AZ file system (up to the maximum of 12). Adding HA pairs isn't disruptive and typically takes only a few minutes to complete.

Consider the following points when adding HA pairs to your file system:
+ Adding HA pairs to your file system introduces new file servers with their own storage (or aggregate). The new HA pairs have the same throughput capacity and storage capacity as your file system's existing HA pairs. For example, assume that your file system has two HA pairs with a total of 12 GBps of throughput capacity and 2 tebibytes (TiB) of SSD storage. If you add one new HA pair, then your file system will have 18 GBps of throughput capacity and 3 TiB of SSD storage. 
+ To benefit from the additional performance of the new HA pairs, you need to move some of your existing volumes to the new HA pairs and remount clients to connect to them. For more information, see [Balancing workloads across HA pairs](monitor-workload-balance.md).
+ You can't modify your file system's throughput capacity, SSD storage capacity, or provisioned SSD IOPS when adding HA pairs or while an update to add HA pairs is in progress.
+ You can't remove HA pairs after you add them. We recommend scaling the throughput capacity of your file system if you need more performance temporarily (assuming that your file system isn't at the highest throughput capacity). This increases the throughput capacity of your file system's existing HA pairs. 
+ The iSCSI protocol is available on file systems that have six or fewer high-availability pairs (HA pairs). The NVMe/TCP protocol is available on second-generation file systems that have six or fewer HA pairs. For more information, see [Accessing your FSx for ONTAP data](supported-fsx-clients.md).
+ When you add new HA pairs to your file system, the NVMe cache is enabled by default for the new file system nodes. We recommend disabling it for throughput-heavy workloads. For more information, see [Managing the NVMe cache](nvme-cache.md).

**To add HA pairs**

1. Open the Amazon FSx console at [https://console.aws.amazon.com/fsx/](https://console.aws.amazon.com/fsx/).

1. To display the file system details page, in the left navigation pane, choose **File systems**, and then choose the FSx for ONTAP file system that you want to update.

1. On the **Summary** panel, for **Number of HA pairs**, choose **Update**.

1. From the **HA Pairs** dropdown, select the number of HA pairs that you want to add to your file system.

1. Choose the **Update** button.

After you add HA pairs, it's important to rebalance your existing data to ensure that your I/O remains evenly distributed across your file system's HA pairs. For more information, see [Balancing workloads across HA pairs](monitor-workload-balance.md).

# Balancing workloads across HA pairs
<a name="monitor-workload-balance"></a>

If you have a file system with multiple high-availability (HA) pairs, then its throughput and storage are spread across each of your HA pairs. FSx for ONTAP automatically balances your files as they are written to your file system, but your workload data and I/O are no longer balanced once you add HA pairs. Additionally, in rare cases, your workload data or I/O could become unbalanced across your file system's existing HA pairs, which can impact your workload's overall performance. If your workload is ever imbalanced, you can rebalance it across each of your file system’s HA pairs (and their commensurate file servers and *aggregates*—the storage pools which make up your primary storage tier).

**Topics**
+ [Primary storage utilization balance](#primary-storage-balance)
+ [File server and disk performance utilization imbalance](#server-disk-imbalance)
+ [Mapping CloudWatch dimensions to ONTAP CLI and REST API resources](#map-dimensions-to-resources)
+ [Rebalancing clients](#rebalancing-clients)
+ [Rebalancing volumes](#rebalancing-volumes)

## Primary storage utilization balance
<a name="primary-storage-balance"></a>

Your file system’s primary storage capacity is divided evenly among each of your HA pairs in storage pools called aggregates. Each HA pair has one aggregate. We recommend that you maintain an average utilization no higher than 80% for your primary storage tier on an ongoing basis. For file systems with multiple HA pairs, we recommend that you maintain an average utilization of up to 80% for every aggregate.

Maintaining 80% utilization ensures there is free space for new incoming data, and maintains a healthy overhead for maintenance operations which can temporarily claim free space on your aggregates.

If you notice that your aggregates are imbalanced, you can either increase your file system’s primary storage capacity (commensurately increasing the storage capacity of each aggregate), or you can move your volumes between aggregates. For more information, see [Moving volumes between aggregates](moving-fg-volumes.md).

## File server and disk performance utilization imbalance
<a name="server-disk-imbalance"></a>

Your file system’s total performance capabilities (such as the network throughput, file server to disk throughput and IOPS, and disk IOPS) is divided evenly among your file system’s HA pairs. We recommend that you maintain an average utilization below 50% (and a maximum peak utilization below 80%) for all performance limits on an ongoing basis—this goes for both the overall utilization of your file system’s file server resources across all HA pairs, as well as on a per-file server basis.

If you notice that your file server performance utilization is imbalanced—and the file servers on which your workload is imbalanced have an ongoing utilization of over 80%—you can use the ONTAP CLI and REST API to further diagnose the cause of performance imbalance and remediate it. Following is a table of possible imbalance indicators and next steps for further diagnosis.


| If your file system's... | Then... | 
| --- | --- | 
| File server disk throughput or file server disk IOPS are imbalanced | You may be experiencing I/O hotspotting on a subset of HA pairs (a subset of your volumes containing an outsized amount of data being accessed) which can limit your workload's overall performance because it's bottlenecked against a subset of HA pairs. For each highly-utilized file server, check the most-utilized volumes to see which volumes have the most activity within an aggregate. For more information on this procedure, see [Rebalancing volumes](#rebalancing-volumes). | 
| Network throughput is imbalanced, but your file server disk throughput, file server disk IOPS, or disk IOPS are not imbalanced  | Your data is evenly-distributed across HA pairs, but your clients are not. For the file servers which have more network throughput utilization than others, check the top clients for each file server, then rebalance those clients by unmounting any volumes from those clients and remounting them using a different endpoint on a different HA pair. For more information on this procedure, see [Rebalancing clients](#rebalancing-clients).  | 

## Mapping CloudWatch dimensions to ONTAP CLI and REST API resources
<a name="map-dimensions-to-resources"></a>

Your second-generation file system has Amazon CloudWatch metrics with the `FileServer` or `Aggregate` dimension. In order to further diagnose cases of imbalance, you need to map these dimension values to specific file servers (or *nodes*) and aggregates in the ONTAP CLI or REST API.
+ For file servers, each file server name maps to a file server (or node) name in ONTAP (for example, `FsxId01234567890abcdef-01`). Odd-numbered file servers are preferred file servers (that is, they service traffic unless the file system has failed over to the secondary file server), while even-numbered file servers are secondary file servers (that is, they serve traffic only when their partner is unavailable). Because of this, secondary file servers will typically show less utilization than preferred file servers.
+ For aggregates, each aggregate name maps to an aggregate in ONTAP (for example, `aggr1`). There is one aggregate for every HA pair, meaning aggregate `aggr1` is shared by file servers `FsxId01234567890abcdef-01` (the active file server) and `FsxId01234567890abcdef-02` (the secondary file server) in an HA pair, aggregate `aggr2` is shared by file servers `FsxId01234567890abcdef-03` and `FsxId01234567890abcdef-04`, and so on.

You can view the mappings between all aggregates and file servers using the ONTAP CLI.

1.  To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. Use the [storage aggregate show](https://docs.netapp.com/us-en/ontap-cli-9131/storage-aggregate-show.html) command, specifying the `-fields node` parameter.

   ```
   ::> storage aggregate show -fields node
   aggregate                       node                      
   ------------------------------- ------------------------- 
   aggr1                           FsxId01234567890abcdef-01
   aggr2                           FsxId01234567890abcdef-03
   aggr3                           FsxId01234567890abcdef-05 
   aggr4                           FsxId01234567890abcdef-07
   aggr5                           FsxId01234567890abcdef-09
   aggr6                           FsxId01234567890abcdef-11 
   6 entries were displayed.
   ```

## Rebalancing clients
<a name="rebalancing-clients"></a>

After adding HA pairs or if you’re experiencing I/O imbalance across file servers (specifically with network throughput utilization), you can rebalance your clients. If you’re rebalancing clients after adding HA pairs, you can skip to [Remounting clients](#remounting-clients). Otherwise, you should first identify high-traffic clients you want to move to rebalance your workload I/O. 

If you're experiencing I/O imbalance across file servers (specifically with Network throughput utilization), high I/O clients may be the cause. To identify high-traffic clients, use the ONTAP CLI.

**Identify high-traffic clients**

1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. To view the highest-traffic clients, use the [statistics top client show](https://docs.netapp.com/us-en/ontap-cli-9131/statistics-top-client-show.html) ONTAP CLI command. You can optionally specify the `-node` parameter to only view the top clients for a specific file server. If you are diagnosing imbalance for a specific file server, use the `-node` parameter, replacing `node_name` with the name of the file server (for example, `FsxId01234567890abcdef-01`).

   You can optionally add the `-interval` parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is `5` (seconds).

   ```
   ::> statistics top client show -node FsxId01234567890abcdef-01 [-interval [5,300]]
   ```

   In the output, the top clients are shown by their IP address and port.

   ```
                                                          *Total     Total
               Client   Vserver          Node                Ops     (Bps)
   ------------------ --------- ------------------------- ------ ---------
    172.17.236.53:938 svm01     FsxId01234567890abcdef-01   2143 140443648
   172.17.236.160:898 svm02     FsxId01234567890abcdef-01    812  53215232
   ```<a name="remounting-clients"></a>

**Remounting clients**
+ You can rebalance clients to other HA pairs. To do so, unmount the volume from the client and remount it using the DNS name for the SVM’s NFS/SMB endpoint—this returns a random endpoint corresponding to a random HA pair.

  We recommend you re-use the DNS name, but you have the option to explicitly choose which HA pair a given client mounts. To guarantee that you are mounting a client to a different endpoint, you can instead specify a different endpoint IP address than the one that corresponds to the file server that is experiencing high traffic. You can do so by running the following command:

  ```
  ::> network interface show -vserver svm_name -lif nfs_smb_management* -fields address,curr-node
  vserver   lif                  address      curr-node                 
  --------- -------------------- ------------ ------------------------- 
  svm01 nfs_smb_management_1 172.31.15.89 FsxId01234567890abcdef-01 
  svm01 nfs_smb_management_3 172.31.8.112 FsxId01234567890abcdef-03 
  2 entries were displayed.
  ```

  According to the example output for the `statistics top client show` command, client `172.17.236.53` is driving high traffic to `FsxId01234567890abcdef-01`. The output of the `network interface show` command indicates this is the address `172.31.15.89`. To mount to a different endpoint, select any other address (in this example, the only other address is `172.31.8.112`, corresponding to `FsxId01234567890abcdef-03`).

## Rebalancing volumes
<a name="rebalancing-volumes"></a>

If you're experiencing I/O imbalance across your volumes or aggregates, you can rebalance volumes in order to redistribute your I/O traffic across your volumes.

**Note**  
If you're experiencing storage utilization imbalance across your aggregates, there is generally not any performance impact unless the high utilization is coupled with I/O imbalance. While you can move volumes between aggregates to balance storage utilization, we recommend only moving volumes if you are seeing a performance impact, as moving volumes can have adverse impact on performance if you don't also consider the I/O driven to each volume you're considering moving.

1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. Use the [statistics volume show](https://docs.netapp.com/us-en/ontap-cli-9131/statistics-volume-show.html) ONTAP CLI command to view the highest-traffic volumes for a given aggregate, with the following changes:
   + Replace *aggregate\$1name* with the aggregate’s name (for example, `aggr1`).
   + You can optionally add the `-interval` parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is `5` (seconds).

   ```
   ::> statistics volume show -aggregate aggregate_name -sort-key total_ops [-interval [5,300]]
   ```

   Depending on the interval you chose, it can take up to 5 minutes to show data. The command shows all volumes in the aggregate, along with the amount of traffic being driven to each aggregate.

   ```
                                *Total Read Write Other      Read Write Latency 
       Volume Vserver Aggregate    Ops  Ops   Ops   Ops     (Bps) (Bps)    (us) 
   ---------- ------- --------- ------ ---- ----- ----- --------- ----- ------- 
   vol1__0007    svm1     aggr1   4078 4078     0     0 267255808     0    1092 
   vol1__0005    svm1     aggr1   4078 4078     0     0 267255808     0    1086 
   vol1__0003    svm1     aggr1   4077 4077     0     0 267223040     0    1086 
   vol1__0001    svm1     aggr1   4077 4077     0     0 267239424     0    1087 
   vol1__0008    svm1     aggr2   2314 2314     0     0 151650304     0    1112 
   vol1__0006    svm1     aggr2   2144 2144     0     0 140509184     0    1104 
   vol1__0002    svm1     aggr2   2183 2183     0     0 143065088     0    1106 
   vol1__0004    svm1     aggr2   2183 2183     0     0 143065088     0    1103
   ```

   The volume statistics are shown on a per-constituent basis (for example, `vol1__0015` is the 15th constituent for FlexGroup `vol1`). You can see from the example output, the constituents for `aggr1` are more highly-utilized than the constituents for `aggr2`. To balance traffic between aggregates, you can move the constituent volumes between aggregates so that traffic is more evenly distributed.

1. If you have added new HA pairs, then you should move existing volumes to new aggregates. For more information, see [Moving volumes between aggregates](moving-fg-volumes.md).