

# Using Elastic Fabric Adapter (EFA) with AWS PCS
<a name="working-with_networking_efa"></a>

 Elastic Fabric Adapter (EFA) is a high performance advanced networking interconnect from AWS that you can attach to your EC2 instance to accelerate High Performance Computing (HPC) and machine learning applications. Enabling your applications running on an AWS PCS cluster with EFA involves configuring the AWS PCS compute node group instances to use EFA as follows. 

**Note**  
**Install EFA on an AWS PCS-compatible AMI** – The AMI used in the AWS PCS compute node group must have the EFA driver installed and loaded. For information on how to build a custom AMI with EFA software installed, see [Custom Amazon Machine Images (AMIs) for AWS PCS](working-with_ami_custom.md).

**Contents**
+ [Identify EFA-enabled EC2 instances](working-with_networking_efa_identify-instances.md)
+ [Create a security group to support EFA communications](working-with_networking_efa_create-sg.md)
+ [(Optional) Create a placement group](working-with_networking_efa_create-placement-group.md)
+ [Create or update an EC2 launch template](working-with_networking_efa_create-lt.md)
+ [Create or update compute node groups for EFA](working-with_networking_efa_create-cng.md)
+ [(Optional) Test EFA](working-with_networking_efa_test-efa.md)
+ [(Optional) Use a CloudFormation template to create an EFA-enabled launch template](working-with_networking_efa_create-lt-cfn.md)

# Identify EFA-enabled EC2 instances
<a name="working-with_networking_efa_identify-instances"></a>

To use EFA, all instance types that are allowed for an AWS PCS compute group must support EFA, and must have the same number of vCPUs (and GPUs if appropriate). For a list of EFA-enabled instances, see [Elastic Fabric Adapter for HPC and ML workloads on Amazon EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) in the *Amazon Elastic Compute Cloud User Guide*. You can also use the AWS CLI to view a list of instance types that support EFA. Replace *region-code* with the AWS Region where you use AWS PCS, such as `us-east-1`.

```
aws ec2 describe-instance-types \ 
   --region region-code \ 
   --filters Name=network-info.efa-supported,Values=true \ 
   --query "InstanceTypes[*].[InstanceType]" \
   --output text | sort
```

**Note**  
**Determine how many network interfaces are available** – Some EC2 instances have multiple network cards. This allows them to have multiple EFAs. For more information, see [Multiple network interfaces in AWS PCS](working-with_networking_multi-nic.md).

# Create a security group to support EFA communications
<a name="working-with_networking_efa_create-sg"></a>

------
#### [ AWS CLI ]

You can use the following AWS CLI command to create a security group that supports EFA. The command outputs a security group ID. Make the following replacements:
+ `region-code` – Specify the AWS Region where you use AWS PCS, such as `us-east-1`.
+ `vpc-id` – Specify the ID of the VPC that you use for AWS PCS.
+ `efa-group-name` – Provide your chosen name for the security group.

```
aws ec2 create-security-group \
    --group-name efa-group-name \
    --description "Security group to enable EFA traffic" \
    --vpc-id vpc-id \
    --region region-code
```

Use the following commands to attach inbound and outbound security group rules. Make the following replacement: 
+ `efa-secgroup-id` – Provide the ID of the EFA security group you just created. 

```
aws ec2 authorize-security-group-ingress \
    --group-id efa-secgroup-id \
    --protocol -1 \
    --source-group efa-secgroup-id
    
aws ec2 authorize-security-group-egress \
    --group-id efa-secgroup-id \
    --protocol -1 \
    --source-group efa-secgroup-id
```

------
#### [ CloudFormation template ]

You can use a CloudFormation template to create a security group that supports EFA. Download the template from the following URL, then upload it into the [AWS CloudFormation console](https://console.aws.amazon.com/cloudformation). 

```
https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/enable_efa/assets/efa-sg.yaml
```

With the template open in the AWS CloudFormation console, enter the following options.
+ Under **Provide a stack name**
  + Under **Stack name**, enter a name such as `efa-sg-stack`.
+ Under **Parameters**
  + Under **SecurityGroupName**, enter a name such as `efa-sg`.
  + Under **VPC**, select the VPC where you will use AWS PCS.

Finish creating the CloudFormation stack and monitor its status. When it reaches `CREATE_COMPLETE` the EFA security group is ready for use. 

------

# (Optional) Create a placement group
<a name="working-with_networking_efa_create-placement-group"></a>

We recommended you launch all instances that use EFA in a cluster placement group to minimize the physical distance between them. Create a placement group for each compute node group where you plan to use EFA. See [Placement groups for EC2 instances in AWS PCS](working-with_networking_placement-groups.md) to create a placement group for your compute node group. 

# Create or update an EC2 launch template
<a name="working-with_networking_efa_create-lt"></a>

EFA network interfaces are set up in the EC2 launch template for an AWS PCS compute node group. If there are multiple network cards, multiple EFAs can be configured. The EFA security group and the optional placement group are included in the launch template as well. 

Here is an example launch template for instances with two network cards, such as **hpc7a.96xlarge**. The instances will be launched in `subnet-SubnetID1` in cluster placement group `pg-PlacementGroupId1`.

 Security groups must be added specifically to each EFA interface. Every EFA needs the security group that enables EFA traffic (`sg-EfaSecGroupId`). Other security groups, especially ones that handle regular traffic like SSH or HTTPS, only need to be attached to the primary network interface (designated by a `DeviceIndex` of `0`). Launch templates where network interfaces are defined do not support setting security groups using the `SecurityGroupIds` parameter—you must set a value for `Groups` in each network interface that you configure. 

```
{
    "Placement": {
        "GroupId": "pg-PlacementGroupId1"
    },
    "NetworkInterfaces": [
        {
            "DeviceIndex": 0,
            "InterfaceType": "efa",
            "NetworkCardIndex": 0,
            "SubnetId": "subnet-SubnetId1",
            "Groups": [
                "sg-SecurityGroupId1",
                "sg-EfaSecGroupId"
            ]
        },
        {
            "DeviceIndex": 1,
            "InterfaceType": "efa",
            "NetworkCardIndex": 1,
            "SubnetId": "subnet-SubnetId1"
            "Groups": ["sg-EfaSecGroupId"]
        }
    ]
}
```

# Create or update compute node groups for EFA
<a name="working-with_networking_efa_create-cng"></a>

Your AWS PCS compute node groups must contain instances that have the same number of vCPUs, processor architecture, and EFA support. Configure the compute node group to use the AMI with the EFA software installed on it, and to use the launch template that configures EFA-enabled network interfaces. 

# (Optional) Test EFA
<a name="working-with_networking_efa_test-efa"></a>

 You can demonstrate EFA-enabled communication between two nodes in a compute node group by running the `fi_pingpong` program, which is included in the EFA software installation. If this test is successful, it is likely that EFA is configured properly. 

 To start, you need two running instances in the compute node group. If your compute node group uses static capacity, there should be already be instances available. For a compute node group that uses dynamic capacity, you can launch two nodes using the `salloc` command. Here is an example from a cluster with a dynamic node group named `hpc7g` associated with a queue named `all`. 

```
% salloc --nodes 2 -p all
salloc: Granted job allocation 6
salloc: Waiting for resource configuration
... a few minutes pass ...
salloc: Nodes hpc7g-[1-2] are ready for job
```

 Find out the IP address for the two allocated nodes using `scontrol`. In the example that follows, the addresses are `10.3.140.69` for `hpc7g-1` and `10.3.132.211` for `hpc7g-2`. 

```
% scontrol show nodes hpc7g-[1-2]
NodeName=hpc7g-1 Arch=aarch64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00
   AvailableFeatures=hpc7g
   ActiveFeatures=hpc7g
   Gres=(null)
   NodeAddr=10.3.140.69 NodeHostName=ip-10-3-140-69 Version=25.05.5
   OS=Linux 5.10.218-208.862.amzn2.aarch64 #1 SMP Tue Jun 4 16:52:10 UTC 2024
   RealMemory=124518 AllocMem=0 FreeMem=110763 Sockets=64 Boards=1
   State=IDLE+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=efa
   BootTime=2024-07-02T19:00:09 SlurmdStartTime=2024-07-08T19:33:25
   LastBusyTime=2024-07-08T19:33:25 ResumeAfterTime=None
   CfgTRES=cpu=64,mem=124518M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Maintain Minimum Number Of Instances [root@2024-07-02T18:59:00]
   InstanceId=i-04927897a9ce3c143 InstanceType=hpc7g.16xlarge

NodeName=hpc7g-2 Arch=aarch64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00
   AvailableFeatures=hpc7g
   ActiveFeatures=hpc7g
   Gres=(null)
   NodeAddr=10.3.132.211 NodeHostName=ip-10-3-132-211 Version=25.05.5
   OS=Linux 5.10.218-208.862.amzn2.aarch64 #1 SMP Tue Jun 4 16:52:10 UTC 2024
   RealMemory=124518 AllocMem=0 FreeMem=110759 Sockets=64 Boards=1
   State=IDLE+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=efa
   BootTime=2024-07-02T19:00:09 SlurmdStartTime=2024-07-08T19:33:25
   LastBusyTime=2024-07-08T19:33:25 ResumeAfterTime=None
   CfgTRES=cpu=64,mem=124518M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Maintain Minimum Number Of Instances [root@2024-07-02T18:59:00]
   InstanceId=i-0a2c82623cb1393a7 InstanceType=hpc7g.16xlarge
```

Connect to one of the nodes (in this example case, `hpc7g-1`) using SSH (or SSM). Note that this is an internal IP address, so you may need to connect from one of your login nodes if you use SSH. Also be aware that the instance needs to be configured with an SSH key by way of the compute node group launch template.

```
% ssh ec2-user@10.3.140.69
```

 Now, launch `fi_pingpong` in server mode. 

```
/opt/amazon/efa/bin/fi_pingpong -p efa
```

 Connect to the second instance (`hpc7g-2`).

```
% ssh ec2-user@10.3.132.211
```

 Run `fi_pingpong` in client mode, connecting to the server on `hpc7g-1`. You should see output that resembles the example below. 

```
% /opt/amazon/efa/bin/fi_pingpong -p efa 10.3.140.69

bytes   #sent   #ack     total       time     MB/sec    usec/xfer   Mxfers/sec
64      10      =10      1.2k        0.00s      3.08      20.75       0.05
256     10      =10      5k          0.00s     21.24      12.05       0.08
1k      10      =10      20k         0.00s     82.91      12.35       0.08
4k      10      =10      80k         0.00s    311.48      13.15       0.08
[error] util/pingpong.c:1876: fi_close (-22) fid 0
```

# (Optional) Use a CloudFormation template to create an EFA-enabled launch template
<a name="working-with_networking_efa_create-lt-cfn"></a>

Because there are several dependencies to setting up EFA, a CloudFormation template has been provided that you can use to configure a compute node group. It supports instances with up to four network cards. To learn more about instances with multiple network cards, see [Elastic network interfaces](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards) in the *Amazon Elastic Compute Cloud User Guide*.

Download the CloudFormation template from the following URL, then upload it to the CloudFormation console in the AWS Region where you use AWS PCS. 

```
https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/enable_efa/assets/pcs-lt-efa.yaml
```

With the template open in the CloudFormation console, enter the following values. Note that the template will provide some default parameter values—you can leave them as their default values. 
+ Under **Provide a stack name**
  + Under **Stack name**, enter a descriptive name. We recommend incorporating the name you will choose for your AWS PCS compute node group, such as `NODEGROUPNAME-efa-lt`.
+ Under **Parameters**
  + Under **NumberOfNetworkCards**, choose the number of network cards in the instances that will be in your node group.
  + Under **VpcId**, choose the VPC where your AWS PCS cluster is deployed.
  + Under **NodeGroupSubnetId**, choose the subnet in your cluster VPC where EFA-enabled instances will be launched.
  + Under **PlacementGroupName**, leave the field blank to create a new cluster placement group for the node group. If you have an existing placement group you want to use, enter its name here.
  + Under **ClusterSecurityGroupId**, choose the security group you are using to allow access to other instances in the cluster and to the AWS PCS API. Many customers choose the default security group from their cluster VPC.
  + Under **SshSecurityGroupId**, provide the ID for a security group you are using to allow inbound SSH access to nodes in your cluster.
  + For **SshKeyName**, select the SSH keypair for access to nodes in your cluster.
  + For **LaunchTemplateName**, enter a descriptive name for the launch template such as `NODEGROUPNAME-efa-lt`. The name must be unique to your AWS account in the AWS Region where you will use AWS PCS.
+ Under **Capabilities**
  + Check the box for **I acknowledge that AWS CloudFormation might create IAM resources**.

 Monitor the status of the CloudFormation stack. When it reaches `CREATE_COMPLETE` the launch template is ready to be used. Use it with an AWS PCS compute node group, as described above in [Create or update compute node groups for EFA](working-with_networking_efa_create-cng.md).