

# Create an Amazon EMR cluster with instance fleets or uniform instance groups
Configure instance fleets or instance groups

When you create a cluster and specify the configuration of the primary node, core nodes, and task nodes, you have two configuration options. You can use *instance fleets* or *uniform instance groups*. The configuration option you choose applies to all nodes, it applies for the lifetime of the cluster, and instance fleets and instance groups cannot coexist in a cluster. The instance fleets configuration is available in Amazon EMR version 4.8.0 and later, excluding 5.0.x versions. 

You can use the Amazon EMR console, the AWS CLI, or the Amazon EMR API to create clusters with either configuration. When you use the `create-cluster` command from the AWS CLI, you use either the `--instance-fleets` parameters to create the cluster using instance fleets or, alternatively, you use the `--instance-groups` parameters to create it using uniform instance groups.

The same is true using the Amazon EMR API. You use either the `InstanceGroups` configuration to specify an array of `InstanceGroupConfig` objects, or you use the `InstanceFleets` configuration to specify an array of `InstanceFleetConfig` objects.

In the new Amazon EMR console, you can choose to use either instance groups or instance fleets when you create a cluster, and you have the option to use Spot Instances with each. With the old Amazon EMR console, if you use the default **Quick Options** settings when you create your cluster, Amazon EMR applies the uniform instance groups configuration to the cluster and uses On-Demand Instances. To use Spot Instances with uniform instance groups, or to configure instance fleets and other customizations, choose **Advanced Options**.

## Instance fleets


The instance fleets configuration offers the widest variety of provisioning options for Amazon EC2 instances. Each node type has a single instance fleet, and using a task instance fleet is optional. You can specify up to five EC2 instance types per fleet, or 30 EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](emr-instance-fleet.md#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. For the core and task instance fleets, you assign a *target capacity* for On-Demand Instances, and another for Spot Instances. Amazon EMR chooses any mix of the specified instance types to fulfill the target capacities, provisioning both On-Demand and Spot Instances.

For the primary node type, Amazon EMR chooses a single instance type from your list of instances, and you specify whether it's provisioned as an On-Demand or Spot Instance. Instance fleets also provide additional options for Spot Instance and On-Demand purchases. Spot Instance options include a timeout that specifies an action to take if Spot capacity can't be provisioned, and a preferred allocation strategy (capacity-optimized) for launching Spot Instance fleets. On-Demand Instance fleets can also be launched using the allocation strategy (lowest-price) option. If you use a service role that is not the EMR default service role, or use an EMR managed policy in your service role, you need to add additional permissions to the custom cluster service role to enable the allocation strategy option. For more information, see [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about configuring instance fleets, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

## Uniform instance groups
Instance groups

Uniform instance groups offer a simpler setup than instance fleets. Each Amazon EMR cluster can include up to 50 instance groups: one primary instance group that contains one Amazon EC2 instance, a core instance group that contains one or more EC2 instances, and up to 48 optional task instance groups. Each core and task instance group can contain any number of Amazon EC2 instances. You can scale each instance group by adding and removing Amazon EC2 instances manually, or you can set up automatic scaling. For information about adding and removing instances, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

For more information about configuring uniform instance groups, see [Configure uniform instance groups for your Amazon EMR cluster](emr-uniform-instance-group.md). 

## Working with instance fleets and instance groups
Working with fleets and groups

**Topics**
+ [

## Instance fleets
](#emr-plan-instance-fleets)
+ [

## Uniform instance groups
](#emr-plan-instance-groups)
+ [

## Working with instance fleets and instance groups
](#emr-plan-instance-topics)
+ [

# Planning and configuring instance fleets for your Amazon EMR cluster
](emr-instance-fleet.md)
+ [

# Reconfiguring instance fleets for your Amazon EMR cluster
](instance-fleet-reconfiguration.md)
+ [

# Use capacity reservations with instance fleets in Amazon EMR
](on-demand-capacity-reservations.md)
+ [

# Configure uniform instance groups for your Amazon EMR cluster
](emr-uniform-instance-group.md)
+ [

# Availability Zone flexibility for an Amazon EMR cluster
](emr-flexibility.md)
+ [

# Configuring Amazon EMR cluster instance types and best practices for Spot instances
](emr-plan-instances-guidelines.md)

# Planning and configuring instance fleets for your Amazon EMR cluster


**Note**  
The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

The instance fleet configuration for Amazon EMR clusters lets you select a wide variety of provisioning options for Amazon EC2 instances, and helps you develop a flexible and elastic resourcing strategy for each node type in your cluster. 

In an instance fleet configuration, you specify a *target capacity* for [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) and [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. When Amazon EC2 reclaims a Spot Instance in a running cluster because of a price increase or instance failure, Amazon EMR tries to replace the instance with any of the instance types that you specify. This makes it easier to regain capacity during a spike in Spot pricing. 

You can specify a maximum of five Amazon EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets, or a maximum of 30 Amazon EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [allocation strategy](#emr-instance-fleet-allocation-strategy) for On-Demand and Spot Instances. 

You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify. If Amazon EMR detects an AWS large-scale event in one or more of the Availability Zones, Amazon EMR automatically attempts to route traffic away from the impacted Availability Zones and tries to launch new clusters that you create in alternate Availability Zones according to your selections. Note that cluster Availability Zone selection happens only at cluster creation. Existing cluster nodes are not automatically re-launched in a new Availability Zone in the event of an Availability Zone outage.

## **Considerations for working with instance fleets**


Consider the following items when you use instance fleets with Amazon EMR.
+ You can have one instance fleet, and only one, per node type (primary, core, task). You can specify up to five Amazon EC2 instance types for each fleet on the AWS Management Console (or a maximum of 30 types per instance fleet when you create a cluster using the AWS CLI or Amazon EMR API and an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy)). 
+ Amazon EMR chooses any or all of the specified Amazon EC2 instance types to provision with both Spot and On-Demand purchasing options.
+ You can establish target capacities for Spot and On-Demand Instances for the core fleet and task fleet. Use vCPU or a generic unit assigned to each Amazon EC2 instance that counts toward the targets. Amazon EMR provisions instances until each target capacity is totally fulfilled. For the primary fleet, the target is always one.
+ You can choose one subnet (Availability Zone) or a range. If you choose a range, Amazon EMR provisions capacity in the Availability Zone that is the best fit.
+ When you specify a target capacity for Spot Instances:
  + For each instance type, specify a maximum Spot price. Amazon EMR provisions Spot Instances if the Spot price is below the maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.
  + For each fleet, define a timeout period for provisioning Spot Instances. If Amazon EMR can't provision Spot capacity, you can terminate the cluster or switch to provisioning On-Demand capacity instead. This only applies for provisioning clusters, not resizing them. If the timeout period ends during the cluster resizing process, unprovisioned Spot requests will be nullified without transferring to On-Demand capacity. 
+ For each fleet, you can specify one of the following allocation strategies for your Spot Instances: price-capacity optimized, capacity-optimized, capacity-optimized-prioritized, lowest-price, or diversified across all pools.
+ For each fleet, you can apply the following allocation strategies for your On-Demand Instances: the lowest-price strategy or the prioritized strategy.
+ For each fleet with On-Demand Instances, you can choose to apply capacity reservation options.
+ If you use allocation strategy for instance fleets, the following considerations apply when you choose subnets for your EMR cluster:
  + When Amazon EMR provisions a cluster with a task fleet, it filters out subnets that lack enough available IP addresses to provision all instances of the requested EMR cluster. This includes IP addresses required for the primary, core, and task instance fleets during cluster launch. Amazon EMR then leverages its allocation strategy to determine the instance pool, based on instance type and remaining subnets with sufficient IP addresses, to launch the cluster.
  + If Amazon EMR cannot launch the whole cluster due to insufficient available IP addresses, it will attempt to identify subnets with enough free IP addresses to launch the essential (core and primary) instance fleets. In such scenarios, your task instance fleet will go into a suspended state, rather than terminating the cluster with an error.
  + If none of the specified subnets contain enough IP addresses to provision the essential core and primary instance fleets, the cluster launch will fail with a **VALIDATION\$1ERROR**. This triggers a **CRITICAL** severity cluster termination event, notifying you that the cluster cannot be launched. To prevent this issue, we recommend increasing the number of IP addresses in your subnets.
+ If you run Amazon EMR release **emr-7.7.0** and above, and you use allocation strategy for instance fleets, you can scale the cluster up to 4000 EC2 instances and 14000 EBS volumes per instance fleet. For release versions below **emr-7.7.0**, the cluster can be scaled up only to 2000 EC2 instances and 7000 EBS volumes per instance fleet.
+ When you launch On-Demand Instances, you can use open or targeted capacity reservations for primary, core, and task nodes in your accounts. You might see insufficient capacity with On-Demand Instances with allocation strategy for instance fleets. We recommend that you specify multiple instance types to diversify and reduce the chance of experiencing insufficient capacity. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

## Instance fleet options


Use the following guidelines to understand instance fleet options.

**Topics**
+ [

### **Setting target capacities**
](#emr-fleet-capacity)
+ [

### **Launch options**
](#emr-fleet-spot-options)
+ [

### **Multiple subnet (Availability Zones) options**
](#emr-multiple-subnet-options)
+ [

### **Master node configuration**
](#emr-master-node-configuration)

### **Setting target capacities**


Specify the target capacities you want for the core fleet and task fleet. When you do, that determines the number of On-Demand Instances and Spot Instances that Amazon EMR provisions. When you specify an instance, you decide how much each instance counts toward the target. When an On-Demand Instance is provisioned, it counts toward the On-Demand target. The same is true for Spot Instances. Unlike core and task fleets, the primary fleet is always one instance. Therefore, the target capacity for this fleet is always one. 

When you use the console, the vCPUs of the Amazon EC2 instance type are used as the count for target capacities by default. You can change this to **Generic units**, and then specify the count for each EC2 instance type. When you use the AWS CLI, you manually assign generic units for each instance type. 

**Important**  
When you choose an instance type using the AWS Management Console, the number of **vCPU** shown for each **Instance type** is the number of YARN vcores for that instance type, not the number of EC2 vCPUs for that instance type. For more information on the number of vCPUs for each instance type, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).

For each fleet, you specify up to five Amazon EC2 instance types. If you use an [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy) and create a cluster using the AWS CLI or the Amazon EMR API, you can specify up to 30 EC2 instance types per instance fleet. Amazon EMR chooses any combination of these EC2 instance types to fulfill your target capacities. Because Amazon EMR wants to fill target capacity completely, an overage might happen. For example, if there are two unfulfilled units, and Amazon EMR can only provision an instance with a count of five units, the instance still gets provisioned, meaning that the target capacity is exceeded by three units. 

If you reduce the target capacity to resize a running cluster, Amazon EMR attempts to complete application tasks and terminates instances to meet the new target. For more information, see [Terminate at task completion](emr-scaledown-behavior.md#emr-scaledown-terminate-task).

### **Launch options**


For Spot Instances, you can specify a **Maximum Spot price** for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. Amazon EMR provisions Spot Instances if the current Spot price in an Availability Zone is below your maximum Spot price. You pay the Spot price, not necessarily the maximum Spot price.

**Note**  
Spot Instances with a defined duration (also known as Spot blocks) are no longer available to new customers from July 1, 2021. For customers who have previously used the feature, we will continue to support Spot Instances with a defined duration until December 31, 2022.

Available in Amazon EMR 5.12.1 and later, you have the option to launch Spot and On-Demand Instance fleets with optimized capacity allocation. This allocation strategy option can be set in the old AWS Management Console or using the API `RunJobFlow`. Note that you can't customize allocation strategy in the new console. Using the allocation strategy option requires additional service role permissions. If you use the default Amazon EMR service role and managed policy ([`EMR_DefaultRole`](emr-iam-role.md) and `AmazonEMRServicePolicy_v2`) for the cluster, the permissions for the allocation strategy option are already included. If you're not using the default Amazon EMR service role and managed policy, you must add them to use this option. See [Service role for Amazon EMR (EMR role)](emr-iam-role.md).

For more information about Spot Instances, see [Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) in the Amazon EC2 User Guide. For more information about On-Demand Instances, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the Amazon EC2 User Guide.

If you choose to launch On-Demand Instance fleets with the lowest-price allocation strategy, you have the option to use capacity reservations. Capacity reservation options can be set using the Amazon EMR API `RunJobFlow`. Capacity reservations require additional service role permissions which you must add to use these options. See [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy). Note that you can't customize capacity reservations in the new console.

### **Multiple subnet (Availability Zones) options**


When you use instance fleets, you can specify multiple Amazon EC2 subnets within a VPC, each corresponding to a different Availability Zone. If you use EC2-Classic, you specify Availability Zones explicitly. Amazon EMR identifies the best Availability Zone to launch instances according to your fleet specifications. Instances are always provisioned in only one Availability Zone. You can select private subnets or public subnets, but you can't mix the two, and the subnets you specify must be within the same VPC.

### **Master node configuration**


Because the primary instance fleet is only a single instance, its configuration is slightly different from core and task instance fleets. You only select either On-Demand or Spot for the primary instance fleet because it consists of only one instance. If you use the console to create the instance fleet, the target capacity for the purchasing option you select is set to 1. If you use the AWS CLI, always set either `TargetSpotCapacity` or `TargetOnDemandCapacity` to 1 as appropriate. You can still choose up to five instance types for the primary instance fleet (or a maximum of 30 when you use the allocation strategy option for On-Demand or Spot Instances). However, unlike core and task instance fleets, where Amazon EMR might provision multiple instances of different types, Amazon EMR selects a single instance type to provision for the primary instance fleet.

## Allocation strategy for instance fleets
Allocation strategy

With Amazon EMR versions 5.12.1 and later, you can use the allocation strategy option with On-Demand and Spot Instances for each cluster node. When you create a cluster using the AWS CLI, Amazon EMR API, or Amazon EMR console with an allocation strategy, you can specify up to 30 Amazon EC2 instance types per fleet. With the default Amazon EMR cluster instance fleet configuration, you can have up to 5 instance types per fleet. We recommend that you use the allocation strategy option for faster cluster provisioning, more accurate Spot Instance allocation, and fewer Spot Instance interruptions.

**Topics**
+ [

### Allocation strategy with On-Demand Instances
](#emr-instance-fleet-allocation-strategy-od)
+ [

### Allocation strategy with Spot Instances
](#emr-instance-fleet-allocation-strategy-spot)
+ [

### Allocation strategy permissions
](#emr-instance-fleet-allocation-strategy-permissions)
+ [

### Required IAM permissions for an allocation strategy
](#create-cluster-allocation-policy)

### Allocation strategy with On-Demand Instances
On-Demand Instances

The following allocation strategies are available for your On-Demand Instances:

`lowest-price`** (default)**  
The lowest-price allocation strategy launches On-Demand instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the On-Demand Instances come from the next lowest-priced pool with available capacity.

`prioritized`  
The prioritized allocation strategy lets you specify a priority value for each instance type for your instance fleet. Amazon EMR launches your On-Demand Instances that have the highest priority. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure the priority value for an instance type, Amazon EMR assigns the lowest priority to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.

**Note**  
If you use the **capacity-optimized-prioritized** Spot allocation strategy, Amazon EMR applies the same priorities to both your On-Demand Instances and Spot instances when you set priorities.

### Allocation strategy with Spot Instances
Spot Instances

For *Spot Instances* you can choose from one of the following allocation strategies:

**`price-capacity-optimized` (recommended) **  
The price-capacity optimized allocation strategy launches Spot instances from the Spot instance pools that have the highest capacity available and the lowest price for the number of instances that are launching. As a result, the price-capacity optimized strategy typically has a higher chance of getting Spot capacity, and delivers lower interruption rates.This is the default strategy for Amazon EMR releases 6.10.0 and higher.

**`capacity-optimized`**  
The capacity-optimized allocation strategy launches Spot Instances into the most available pools with the lowest chance of interruption in the near term. This is a good option for workloads that might have a higher cost of interruption associated with work that gets restarted. This is the default strategy for Amazon EMR releases 6.9.0 and lower.

**`capacity-optimized-prioritized`**  
The capacity-optimized-prioritized allocation strategy lets you specify a priority value for each instance type in your instance fleet. Amazon EMR optimizes for capacity first, but it respects instance type priorities on a best-effort basis, such as if the priority doesn't significantly affect the fleet's ability to provision optimal capacity. We recommend this option if you have workloads that must have a minimal amount of disruption that still have a need for certain instance types. If you use this strategy, you must configure the priority for at least one instance type. If you don't configure a priority for any instance type, Amazon EMR assigns the lowest priority value to that instance type. Each instance fleet (primary, core, or task) in a cluster can have a different priority value for a given instance type.  
If you use the **prioritized** On-Demand allocation strategy, Amazon EMR applies the same priority value to both your On-Demand and Spot Instances when you set priorities.

**`diversified`**  
With the diversified allocation strategy, Amazon EC2 distributes Spot Instances across all Spot capacity pools.

**`lowest-price`**  
The lowest-price allocation strategy launches Spot Instances from the lowest priced pool that has available capacity. If the lowest-priced pool doesn't have available capacity, the Spot Instances come from the next lowest priced pool that has available capacity. If a pool runs out of capacity before it fulfills your requested capacity, the Amazon EC2 fleet draws from the next lowest priced pool to continue to fulfill your request. To ensure that your desired capacity is met, you might receive Spot Instances from several pools. Because this strategy only considers instance price, and does not consider capacity availability, it might lead to high interruption rates.

### Allocation strategy permissions
Allocation strategy permissions

The allocation strategy option requires several IAM permissions that are automatically included in the default Amazon EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you use a custom service role or managed policy for your cluster, you must add these permissions before you create the cluster. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](#create-cluster-allocation-policy).

Optional On-Demand Capacity Reservations (ODCRs) are available when you use the On-Demand allocation strategy option. Capacity reservation options let you specify a preference for using reserved capacity first for Amazon EMR clusters. You can use this to ensure that your critical workloads use the capacity you have already reserved using open or targeted ODCRs. For non-critical workloads, the capacity reservation preferences let you specify whether reserved capacity should be consumed.

Capacity reservations can only be used by instances that match their attributes (instance type, platform, and Availability Zone). By default, open capacity reservations are automatically used by Amazon EMR when provisioning On-Demand Instances that match the instance attributes. If you don't have any running instances that match the attributes of the capacity reservations, they remain unused until you launch an instance matching their attributes. If you don't want to use any capacity reservations when launching your cluster, you must set capacity reservation preference to **none** in launch options.

However, you can also target a capacity reservation for specific workflows. This enables you to explicitly control which instances are allowed to run in that reserved capacity. For more information about On-Demand capacity reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

### Required IAM permissions for an allocation strategy
IAM permissions for allocation strategy

Your [Service role for Amazon EMR (EMR role)](emr-iam-role.md) requires additional permissions to create a cluster that uses the allocation strategy option for On-Demand or Spot Instance fleets.

We automatically include these permissions in the default Amazon EMR service role [`EMR_DefaultRole`](emr-iam-role.md) and the Amazon EMR managed policy [`AmazonEMRServicePolicy_v2`](emr-managed-iam-policies.md).

If you use a custom service role or managed policy for your cluster, you must add the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DeleteLaunchTemplate",
        "ec2:CreateLaunchTemplate",
        "ec2:DescribeLaunchTemplates",
        "ec2:CreateLaunchTemplateVersion",
        "ec2:CreateFleet"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Deletelaunchtemplate"
    }
  ]
}
```

------

The following service role permissions are required to create a cluster that uses open or targeted capacity reservations. You must include these permissions in addition to the permissions required for using the allocation strategy option.

**Example Policy document for service role capacity reservations**  
To use open capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

**Example**  
To use targeted capacity reservations, you must include the following additional permissions.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeCapacityReservations",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DeleteLaunchTemplateVersions",
        "resource-groups:ListGroupResources"
      ],
      "Resource": [
        "*"
      ],
      "Sid": "AllowEC2Describecapacityreservations"
    }
  ]
}
```

## Configure instance fleets for your cluster
Configure instance fleets

------
#### [ Console ]

**To create a cluster with instance fleets with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance fleets**.

1. For each **Node group**, select **Add instance type** and choose up to 5 instance types for primary and core instance fleets and up to fifteen instance types for task instance fleets. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit weighted capacity**  
For the core node group, change this value to any number of units that fits your applications. The number of YARN vCores for each fleet instance type is used as the default weighted capacity units. You can't edit weighted capacity for the primary node.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, to add security groups for your nodes, expand **EC2 security groups (firewall)** in the **Networking** section and select your security group for each node type.

1. Optionally, select the check box next to **Apply allocation strategy** if you want to use the allocation strategy option, and select the allocation strategy that you want to specify for the Spot Instances. You shouldn't select this option if your Amazon EMR service role doesn't have the required permissions. For more information, see [Allocation strategy for instance fleets](#emr-instance-fleet-allocation-strategy).

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ AWS CLI ]

To create and launch a cluster with instance fleets with the AWS CLI, follow these guidelines:
+ To create and launch a cluster with instance fleets, use the `create-cluster` command along with `--instance-fleet` parameters.
+ To get configuration details about the instance fleets in a cluster, use the `list-instance-fleets` command.
+ To add multiple custom Amazon Linux AMIs to a cluster you’re creating, use the `CustomAmiId` option with each `InstanceType` specification. You can configure instance fleet nodes with multiple instance types and multiple custom AMIs to fit your requirements. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli). 
+ To make changes to the target capacity for an instance fleet, use the `modify-instance-fleet` command.
+ To add a task instance fleet to a cluster that doesn't already have one, use the `add-instance-fleet` command.
+ Multiple custom AMIs can be added to the task instance fleet using the CustomAmiId argument with the add-instance-fleet command. See [Examples: Creating a cluster with the instance fleets configuration](#create-cluster-instance-fleet-cli).
+ To use the allocation strategy option when creating an instance fleet, update the service role to include the example policy document in the following section.
+ To use the capacity reservations options when creating an instance fleet with On-Demand allocation strategy, update the service role to include the example policy document in the following section.
+ The instance fleets are automatically included in the default EMR service role and Amazon EMR managed policy (`EMR_DefaultRole` and `AmazonEMRServicePolicy_v2`). If you are using a custom service role or custom managed policy for your cluster, you must add the new permissions for allocation strategy in the following section.

------

## Examples: Creating a cluster with the instance fleets configuration
Examples

The following examples demonstrate `create-cluster` commands with a variety of options that you can combine.

**Note**  
If you have not previously created the default Amazon EMR service role and EC2 instance profile, use `aws emr create-default-roles` to create them before using the `create-cluster` command.

**Example: On-Demand primary, On-Demand core with single instance type, Default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}']
```

**Example: Spot primary, Spot core with single instance type, default VPC**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

**Example: On-Demand primary, mixed core with single instance type, single EC2 subnet**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=2,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=2}']
```

**Example: On-Demand primary, spot core with multiple weighted instance Types, Timeout for Spot, Range of EC2 Subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetSpotCapacity=11,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'}
```

**Example: On-Demand primary, mixed core and task with multiple weighted instance types, timeout for core Spot Instances, range of EC2 subnets**  

```
aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m5.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=8,TargetSpotCapacity=6,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}',\
'{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
    InstanceFleetType=TASK,TargetOnDemandCapacity=3,TargetSpotCapacity=3,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,WeightedCapacity=3}']
```

**Example: Spot primary, no core or task, Amazon EBS configuration, default VPC**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \ 
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=60,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5,\
EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,\
SizeIn GB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iop s=100},VolumesPerInstance=4}]}}']
```

**Example: Multiple custom AMIs, multiple instance types, on-demand primary, on-demand core**  

```
aws emr create-cluster --release-label Amazon EMR 5.3.1 --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}'] \ 
    InstanceFleetType=CORE,TargetOnDemandCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456},{InstanceType=m6g.xlarge, CustomAmiId=ami-234567}']
```

**Example: Add a task node to a running cluster with multiple instance types and multiple custom AMIs**  

```
aws emr add-instance-fleet --cluster-id j-123456 --release-label Amazon EMR 5.3.1 \
  --service-role EMR_DefaultRole \
  --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleet \
    InstanceFleetType=Task,TargetSpotCapacity=1,\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,CustomAmiId=ami-123456}',\
'{InstanceType=m6g.xlarge,CustomAmiId=ami-234567}']
```

**Example: Use a JSON configuration file**  
You can configure instance fleet parameters in a JSON file, and then reference the JSON file as the sole parameter for instance fleets. For example, the following command references a JSON configuration file, `my-fleet-config.json`:  

```
aws emr create-cluster --release-label emr-5.30.0 --service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole \
--instance-fleets file://my-fleet-config.json
```
The *my-fleet-config.json* file specifies primary, core, and task instance fleets as shown in the following example. The core instance fleet uses a maximum Spot price (`BidPrice`) as a percentage of On-Demand, while the task and primary instance fleets use a maximum Spot price (BidPriceAsPercentageofOnDemandPrice) as a string in USD.  

```
[
    {
        "Name": "Masterfleet",
        "InstanceFleetType": "MASTER",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "SWITCH_TO_ON_DEMAND"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    },
    {
        "Name": "Corefleet",
        "InstanceFleetType": "CORE",
        "TargetSpotCapacity": 1,
        "TargetOnDemandCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "UsageStrategy": "use-capacity-reservations-first",
                "CapacityReservationResourceGroupArn": "String"
            }
        },
            "SpotSpecification": {
                "AllocationStrategy": "capacity-optimized",
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPriceAsPercentageOfOnDemandPrice": 100
            }
        ]
    },
    {
        "Name": "Taskfleet",
        "InstanceFleetType": "TASK",
        "TargetSpotCapacity": 1,
        "LaunchSpecifications": {
          "OnDemandSpecification": {
            "AllocationStrategy": "lowest-price",
            "CapacityReservationOptions": 
            {
                "CapacityReservationPreference": "none"
            }
        },
            "SpotSpecification": {
                "TimeoutDurationMinutes": 120,
                "TimeoutAction": "TERMINATE_CLUSTER"
            }
        },
        "InstanceTypeConfigs": [
            {
                "InstanceType": "m5.xlarge",
                "BidPrice": "0.89"
            }
        ]
    }
]
```

## Modify target capacities for an instance fleet
Change target capacities

Use the `modify-instance-fleet` command to specify new target capacities for an instance fleet. You must specify the cluster ID and the instance fleet ID. Use the `list-instance-fleets` command to retrieve instance fleet IDs.

```
aws emr modify-instance-fleet --cluster-id <cluster-id> \
  --instance-fleet \
    InstanceFleetId='<instance-fleet-id>',TargetOnDemandCapacity=1,TargetSpotCapacity=1
```

## Add a task instance fleet to a cluster
Add a task fleet

If a cluster has only primary and core instance fleets, you can use the `add-instance-fleet` command to add a task instance fleet. You can only use this to add task instance fleets.

```
aws emr add-instance-fleet --cluster-id <cluster-id> 
  --instance-fleet \
    InstanceFleetType=TASK,TargetSpotCapacity=1,\
LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=20,TimeoutAction=TERMINATE_CLUSTER}'},\
InstanceTypeConfigs=['{InstanceType=m5.xlarge,BidPrice=0.5}']
```

## Get configuration details of instance fleets in a cluster
Get configuration details

Use the `list-instance-fleets` command to get configuration details of the instance fleets in a cluster. The command takes a cluster ID as input. The following example demonstrates the command and its output for a cluster that contains a primary task instance group and a core task instance group. For full response syntax, see [ListInstanceFleets](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_ListInstanceFleets.html) in the *Amazon EMR API Reference.*

```
list-instance-fleets --cluster-id <cluster-id>			
```

```
{
    "InstanceFleets": [
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759094.637,
                    "CreationDateTime": 1488758719.817
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 6,
            "Name": "CORE",
            "InstanceFleetType": "CORE",
            "LaunchSpecifications": {
                "SpotSpecification": {
                    "TimeoutDurationMinutes": 60,
                    "TimeoutAction": "TERMINATE_CLUSTER"
                }
            },
            "ProvisionedOnDemandCapacity": 2,
            "InstanceTypeSpecifications": [
                {
                    "BidPrice": "0.5",
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 2
                }
            ],
            "Id": "if-1ABC2DEFGHIJ3"
        },
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1488759058.598,
                    "CreationDateTime": 1488758719.811
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": ""
                }
            },
            "ProvisionedSpotCapacity": 0,
            "Name": "MASTER",
            "InstanceFleetType": "MASTER",
            "ProvisionedOnDemandCapacity": 1,
            "InstanceTypeSpecifications": [
                {
                    "BidPriceAsPercentageOfOnDemandPrice": 100.0,
                    "InstanceType": "m5.xlarge",
                    "WeightedCapacity": 1
                }
            ],
           "Id": "if-2ABC4DEFGHIJ4"
        }
    ]
}
```

# Reconfiguring instance fleets for your Amazon EMR cluster


With Amazon EMR version 5.21.0 and later, you can reconfigure cluster applications and specify additional configuration classifications for each instance fleet in a running cluster. To do so, you can use the AWS Command Line Interface (AWS CLI), or the AWS SDK.

You can track the state of an instance fleet, by viewing the CloudWatch events. For more information, see [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig).

**Note**  
You can only override the cluster Configurations object specified during cluster creation. For more information about Configurations objects, see [RunJobFlow request syntax](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html#API_RunJobFlow_RequestSyntax). If there are differences between the existing configuration and the file that you supply, Amazon EMR resets manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

When you submit a reconfiguration request using the Amazon EMR console, the AWS Command Line interface (AWS CLI), or the AWS SDK, Amazon EMR checks the existing on-cluster configuration file. If there are differences between the existing configuration and the file that you supply, Amazon EMR initiates reconfiguration actions, restarts some applications, and resets any manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance fleet.

## Reconfiguration behaviors


Reconfiguration overwrites on-cluster configuration with the newly submitted configuration set, and can overwrite configuration changes made outside of the reconfiguration API.

Amazon EMR follows a rolling process to reconfigure instances in the Task and Core instance fleet. Only a percentage of the instances for a single instance type are modified and restarted at a time. If your instance fleet has multiple different instance type configurations, they would reconfigure in parallel.

Reconfigurations are declared at the [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) level. For a visual example, refer to [Reconfigure an instance fleet](#instance-fleet-reconfiguration-cli-sdk). You can submit reconfiguration requests that contain updated configuration settings for one or more instance types within a single request. You must include all instance types that are part of your instance fleet in the modify request; however, instance types with populated configuration fields will undergo reconfiguration, while other `InstanceTypeConfig` instances in the fleet remain unchanged. A reconfiguration is considered successful only when all instances of the specified instance types complete reconfiguration. If any instance fails to reconfigure, the entire Instance Fleet automatically reverts to its last known stable configuration.

## Limitations


When you reconfigure an instance fleet in a running cluster, consider the following limitations:
+ Non-YARN applications can fail during restart or cause cluster issues, especially if the applications aren't configured properly. Clusters approaching maximum memory and CPU usage may run into issues after the restart process. This is especially true for the primary instance fleet. Consult the [Troubleshoot instance fleet reconfiguration](#instance-fleet-reconfiguration-troubleshooting) section.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ Resizes and Reconfiguration operation do not happen in parallel. Reconfiguration requests will wait for an ongoing resize and vice versa.
+ After reconfiguring an instance fleet, Amazon EMR restarts the applications to allow the new configurations to take effect. Job failure or other unexpected application behavior might occur if the applications are in use during reconfiguration.
+ If a reconfiguration for any instance type config under an instance fleet fails, Amazon EMR reverses the configuration parameters to the previous working version for the entire instance fleet, along with emitting events and updating state details. If the reversion process fails too, you must submit a new `ModifyInstanceFleet` request to recover the instance fleet from the `ARRESTED` state. Reversion failures result in [Instance fleet reconfiguration events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-instance-fleet-events-reconfig) and state change. 
+ Reconfiguration requests for Phoenix configuration classifications are only supported in Amazon EMR version 5.23.0 and later, and are not supported in Amazon EMR version 5.21.0 or 5.22.0.
+ Reconfiguration requests for HBase configuration classifications are only supported in Amazon EMR version 5.30.0 and later, and are not supported in Amazon EMR versions 5.23.0 through 5.29.0.
+ Reconfiguring hdfs-encryption-zones classification or any of the Hadoop KMS configuration classifications is not supported on an Amazon EMR cluster with multiple primary nodes.
+ Amazon EMR currently doesn't support certain reconfiguration requests for the YARN capacity scheduler that require restarting the YARN ResourceManager. For example, you cannot completely remove a queue.
+ When YARN needs to restart, all running YARN jobs are typically terminated and lost. This might cause data processing delays. To run YARN jobs during a YARN restart, you can either create an Amazon EMR cluster with multiple primary nodes or set yarn.resourcemanager.recovery.enabled to `true` in your yarn-site configuration classification. For more information about using multiple master nodes, see [High availability YARN ResourceManager](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-YARN).

## Reconfigure an instance fleet


------
#### [ Using the AWS CLI ]

Use the `modify-instance-fleet` command to specify a new configuration for an instance fleet in a running cluster.

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

**Example – Replace a configuration for an instance fleet**

**Warning**  
Specify all `InstanceTypeConfig` fields that you used at launch. Not including fields can result in overwriting specifications you declared at launch. Refer to [InstanceTypeConfig](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceTypeConfig.html) for a list.

The following example references a configuration JSON file called instanceFleet.json to edit the property of the YARN NodeManager disk health checker for an instance fleet.

**Instance Fleet Modification JSON**

1. Prepare your configuration classification, and save it as instanceFleet.json in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                  other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

**Example – Add a configuration to an instance fleet**

If you want to add a configuration to an instance type, you must include all previously specified configurations for that instance type in your new `ModifyInstanceFleet` request. Otherwise, the previously specified configurations are removed.

The following example adds a property for the YARN NodeManager virtual memory checker. The configuration also includes previously specified values for the YARN NodeManager disk health checker so that the values won't be overwritten.

1. Prepare the following contents in instanceFleet.json and save it in the same directory where you will run the command.

   ```
   {
       "InstanceFleetId":"if-1xxxxxxx9",
       "InstanceTypeConfigs": [
               {
                   "InstanceType": "m5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"true",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0",
                               "yarn.nodemanager.vmem-check-enabled":"true",
                               "yarn.nodemanager.vmem-pmem-ratio":"3.0"
                           }
                       }
                   ]
               },
               {
                   "InstanceType": "r5.xlarge",
                   other InstanceTypeConfig fields
                   "Configurations": [
                       {
                           "Classification": "yarn-site",
                           "Properties": {
                               "yarn.nodemanager.disk-health-checker.enable":"false",
                               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"70.0"
                           }
                       }
                   ]
               }
           ]      
   }
   ```

1. Run the following command.

   ```
   aws emr modify-instance-fleet \
   --cluster-id j-2AL4XXXXXX5T9 \
   --region us-west-2 \
   --instance-fleet instanceFleet.json
   ```

------
#### [ using the Java SDK ]

**Note**  
In the following examples, replace **j-2AL4XXXXXX5T9** with your cluster ID, and replace **if-1xxxxxxx9** with your instance fleet ID.

The following code snippet provides a new configuration for an instance fleet using the AWS SDK for Java.

```
AWSCredentials credentials = new BasicAWSCredentials("access-key", "secret-key");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.join.emit.interval","1000");
hiveProperties.put("hive.merge.mapfiles","true");
        
Configuration newConfiguration = new Configuration()
    .withClassification("hive-site")
    .withProperties(hiveProperties);
    
List<InstanceTypeConfig> instanceTypeConfigList = new ArrayList<>();

for (InstanceTypeConfig instanceTypeConfig : currentInstanceTypeConfigList) {
    instanceTypeConfigList.add(new InstanceTypeConfig()
        .withInstanceType(instanceTypeConfig.getInstanceType())
        .withBidPrice(instanceTypeConfig.getBidPrice())
        .withWeightedCapacity(instanceTypeConfig.getWeightedCapacity())
        .withConfigurations(newConfiguration)
    );
}

InstanceFleetModifyConfig instanceFleetModifyConfig = new InstanceFleetModifyConfig()
    .withInstanceFleetId("if-1xxxxxxx9")
    .withInstanceTypeConfigs(instanceTypeConfigList);
    
ModifyInstanceFleetRequest modifyInstanceFleetRequest = new ModifyInstanceFleetRequest()
    .withInstanceFleet(instanceFleetModifyConfig)
    .withClusterId("j-2AL4XXXXXX5T9");

emrClient.modifyInstanceFleet(modifyInstanceFleetRequest);
```

------

## Troubleshoot instance fleet reconfiguration


If the reconfiguration process for any instance type within an instance fleet fails, Amazon EMR reverts the in progress reconfiguration and logs a failure message using an AAmazon CloudWatch Events events. The event provides a brief summary of the reconfiguration failure. It lists the instances for which reconfiguration has failed and corresponding failure messages. The following is an example failure message.

`Amazon EMR couldn't revert the instance fleet if-1xxxxxxx9 in the Amazon EMR cluster j-2AL4XXXXXX5T9 (ExampleClusterName) to the previously successful configuration at 2021-01-01 00:00 UTC. The reconfiguration reversion failed because of Instance i-xxxxxxx1, i-xxxxxxx2, i-xxxxxxx3 failed with message "This is an example failure message"...`

### To access node provisioning logs


Use SSH to connect to the node on which reconfiguration has failed. For instructions, see [Connect to your Linux instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-linux-instance.html) in the *Amazon Elastic Compute Cloud*. 

------
#### [ Accessing logs by connecting to a node ]

1. Navigate to the following directory, which contains the node provisioning log files.

   ```
   /mnt/var/log/provision-node/
   ```

1. Open the reports subdirectory and search for the node provisioning report for your reconfiguration. The reports directory organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

1. You can examine a report using a file viewer like zless, as in the following example.

   ```
   zless 202104061715.yaml.gz
   ```

------
#### [ Accessing logs using Amazon S3 ]

Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/). Open the Amazon S3 bucket that you specified when you configured the cluster to archive log files.

1.  Navigate to the following folder, which contains the node provisioning log files:

   ```
   amzn-s3-demo-bucket/elasticmapreduce/cluster id/node/instance id/provision-node/
   ```

1. Open the reports folder and search for the node provisioning report for your reconfiguration. The reports folder organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process. The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

To view a log file, you can download it from Amazon S3 to your local machine as a text file. For instructions, see [Downloading an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html).

------

Each log file contains a detailed provisioning report for the associated reconfiguration. To find error message information, you can search for the `err` log level of a report. Report format depends on the version of Amazon EMR on your cluster. The following example shows error information for Amazon EMR release versions 5.32.0 and 6.2.0 and later use the following format: 

```
- level: err
  message: 'Example detailed error message.'
  source: Puppet
  tags:
  - err
  time: '2021-01-01 00:00:00.000000 +00:00'
  file: 
  line:
```

# Use capacity reservations with instance fleets in Amazon EMR


To launch On-Demand Instance fleets with capacity reservations options, attach additional service role permissions which are required to use capacity reservation options. Since capacity reservation options must be used together with On-Demand allocation strategy, you also have to include the permissions required for allocation strategy in your service role and managed policy. For more information, see [Allocation strategy permissionsRequired IAM permissions for an allocation strategy](emr-instance-fleet.md#create-cluster-allocation-policy).

Amazon EMR supports both open and targeted capacity reservations. The following topics show instance fleets configurations that you can use with the `RunJobFlow` action or `create-cluster` command to launch instance fleets using On-Demand Capacity Reservations. 

## Use open capacity reservations on a best-effort basis


If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. However, it is not guaranteed that your capacity reservations will be used. For provisioning the cluster, Amazon EMR evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Available open capacity reservations that match the instance pools are applied automatically.

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations on a best-effort basis.

**Example 1: Lowest-price instance pool in launch request has available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. Your available open capacity reservations in that instance pool are used automatically.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 150 | 100 | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 100 | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | 50 | 100 | 100 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Example 2: Lowest-price instance pool in launch request does not have available open capacity reservations**

In this case, Amazon EMR launches capacity in the lowest-price instance pool with On-Demand Instances. However, your open capacity reservations remain unused.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
|  Available Open capacity reservations  | - | - | 100 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | - | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 100 | 
| --- |--- |--- |--- |

**Configure Instance Fleets to use open capacity reservations on best-effort basis**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for capacity reservations options to `open`. Alternatively, if you leave this field blank, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "open"
         }
        }
    }
```

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using open capacity reservations.

```
aws emr create-cluster \
	--name 'open-ODCR-cluster' \
	--release-label emr-5.30.0 \
	--service-role EMR_DefaultRole \
	--ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
	--instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
	  InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
	  LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=open}}'}
```

Where,
+ `open-ODCR-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use open capacity reservations first


You can choose to override the lowest-price allocation strategy and prioritize using available open capacity reservations first while provisioning an Amazon EMR cluster. In this case, Amazon EMR evaluates all the instance pools with capacity reservations specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with capacity reservations have sufficient capacity for the requested core nodes, Amazon EMR falls back to the best-effort case described in the previous topic. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and uses the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations that match the instance pool are applied automatically. If available open capacity reservations do not match the instance pool, they remain unused. 

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR uses the available open capacity reservations available across each instance pool in the selected Availability Zone first, and only if required, uses the lowest-price strategy to provision any remaining task nodes. 

The following are use cases of Amazon EMR capacity allocation logic for using open capacity reservations first.

**Example 1: **Instance pool with available open capacity reservations in launch request has sufficient capacity for core nodes****

In this case, Amazon EMR launches capacity in the instance pool with available open capacity reservations regardless of instance pool price. As a result, your open capacity reservations are used whenever possible, until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Open capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available Open capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: **Instance pool with available open capacity reservations in launch request does not have sufficient capacity for core nodes****

In this case, Amazon EMR falls back to launching core nodes using lowest-price strategy with a best-effort to use capacity reservations.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available Open capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Open capacity reservation used | 10 | - | - | 
| --- |--- |--- |--- |
| Available open capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use open capacity reservations first**

When you use the `RunJobFlow` action to create an instance fleet-based cluster, set the On-Demand allocation strategy to `lowest-price` and `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first"
         }
       }
    }
```

You can also use the Amazon EMR CLI to create an instance-fleet based cluster using capacity reservations first.

```
aws emr create-cluster \
  --name 'use-CR-first-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first}}'}
```

Where,
+ `use-CR-first-cluster` is replaced with the name of the cluster using open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Use targeted capacity reservations first


When you provision an Amazon EMR cluster, you can choose to override the lowest-price allocation strategy and prioritize using available targeted capacity reservations first. In this case, Amazon EMR evaluates all the instance pools with targeted capacity reservations specified in the launch request and picks the one with the lowest price that has sufficient capacity to launch all the requested core nodes. If none of the instance pools with targeted capacity reservations have sufficient capacity for core nodes, Amazon EMR falls back to the best-effort case described earlier. That is, Amazon EMR re-evaluates all the instance pools specified in the launch request and selects the one with the lowest price that has sufficient capacity to launch all the requested core nodes. Available open capacity reservations which match the instance pool get applied automatically. However, targeted capacity reservations remain unused.

Once the core nodes are provisioned, the Availability Zone is selected and fixed. Amazon EMR provisions task nodes into instance pools with targeted capacity reservations, starting with the lowest-priced ones first, in the selected Availability Zone until all the task nodes are provisioned. Amazon EMR tries to use the available targeted capacity reservations available across each instance pool in the selected Availability Zone first. Then, only if required, Amazon EMR uses the lowest-price strategy to provision any remaining task nodes.

The following are use cases of Amazon EMR capacity allocation logic for using targeted capacity reservations first.

**Example 1: Instance pool with available targeted capacity reservations in launch request has sufficient capacity for core nodes**

In this case, Amazon EMR launches capacity in the instance pool with available targeted capacity reservations regardless of instance pool price. As a result, your targeted capacity reservations are used whenever possible until all core nodes are provisioned.


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Usage Strategy | use-capacity-reservations-first | 
| Requested Capacity | 100 | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | - | - | 150 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | - | - | 100 | 
| --- |--- |--- |--- |
| Targeted capacity reservation used | - | - | 100 | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | - | 50 | 
| --- |--- |--- |--- |

**Example 2: Instance pool with available targeted capacity reservations in launch request does not have sufficient capacity for core nodes**  


|  |  | 
| --- |--- |
| On-Demand Strategy | lowest-price | 
| Requested Capacity | 100 | 
| Usage Strategy | use-capacity-reservations-first | 
| Instance Type | c5.xlarge | m5.xlarge | r5.xlarge | 
| Available targeted capacity reservations | 10 | 50 | 50 | 
| On-Demand Price | \$1 | \$1\$1 | \$1\$1\$1 | 
| Instances Provisioned | 100 | - | - | 
| --- |--- |--- |--- |
| Targeted capacity reservations used | 10 | - | - | 
| --- |--- |--- |--- |
| Available targeted capacity reservations | - | 50 | 50 | 
| --- |--- |--- |--- |

After the instance fleet is launched, you can run [https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html](https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-capacity-reservations.html) to see how many unused capacity reservations remain.

**Configure Instance Fleets to use targeted capacity reservations first**

When you use the `RunJobFlow` action to create an instance-fleet based cluster, set the On-Demand allocation strategy to `lowest-price`, `UsageStrategy` for `CapacityReservationOptions` to `use-capacity-reservations-first`, and `CapacityReservationResourceGroupArn` for `CapacityReservationOptions` to `<your resource group ARN>`. For more information, see [Work with capacity reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-using.html) in the *Amazon EC2 User Guide*.

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "UsageStrategy": "use-capacity-reservations-first",
            "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup"
         }
       }
    }
```

Where `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

You can also use the Amazon EMR CLI to create an instance fleet-based cluster using targeted capacity reservations.

```
aws emr create-cluster \
  --name 'targeted-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,\
InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={UsageStrategy=use-capacity-reservations-first,CapacityReservationResourceGroupArn=arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup}}'}
```

Where,
+ `targeted-CR-cluster` is replaced with the name of your cluster using targeted capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.
+ `arn:aws:resource-groups:sa-east-1:123456789012:group/MyCRGroup` is replaced with your resource group ARN.

## Avoid using available open capacity reservations


**Example**  
If you want to avoid unexpectedly using any of your open capacity reservations when launching an Amazon EMR cluster, set the On-Demand allocation strategy to `lowest-price` and `CapacityReservationPreference` for `CapacityReservationOptions` to `none`. Otherwise, Amazon EMR defaults the On-Demand Instance's capacity reservation preference to `open` and tries using available open capacity reservations on a best-effort basis.  

```
"LaunchSpecifications": 
    {"OnDemandSpecification": {
        "AllocationStrategy": "lowest-price",
        "CapacityReservationOptions":
         {
            "CapacityReservationPreference": "none"
         }
       }
    }
```
You can also use the Amazon EMR CLI to create an instance fleet-based cluster without using any open capacity reservations.  

```
aws emr create-cluster \
  --name 'none-CR-cluster' \
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-fleets \
    InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=c4.xlarge}'] \
    InstanceFleetType=CORE,TargetOnDemandCapacity=100,InstanceTypeConfigs=['{InstanceType=c5.xlarge},{InstanceType=m5.xlarge},{InstanceType=r5.xlarge}'],\
LaunchSpecifications={OnDemandSpecification='{AllocationStrategy=lowest-price,CapacityReservationOptions={CapacityReservationPreference=none}}'}
```
Where,  
+ `none-CR-cluster` is replaced with the name of your cluster that is not using any open capacity reservations.
+ `subnet-22XXXX01` is replaced with the subnet ID.

## Scenarios for using capacity reservations


You can benefit from using capacity reservations in the following scenarios.

**Scenario 1: Rotate a long-running cluster using capacity reservations**  
When rotating a long running cluster, you might have strict requirements on the instance types and Availability Zones for the new instances you provision. With capacity reservations, you can use capacity assurance to complete the cluster rotation without interruptions.

![\[Cluster rotation using available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-longrunning-cluster-diagram.png)


**Scenario 2: Provision successive short-lived clusters using capacity reservations**  
You can also use capacity reservations to provision a group of successive, short-lived clusters for individual workloads so that when you terminate a cluster, the next cluster can use the capacity reservations. You can use targeted capacity reservations to ensure that only the intended clusters use the capacity reservations.

![\[Short-lived cluster provisioning that uses available capacity reservations\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/odcr-short-cluster-diagram.png)


# Configure uniform instance groups for your Amazon EMR cluster


With the instance groups configuration, each node type (master, core, or task) consists of the same instance type and the same purchasing option for instances: On-Demand or Spot. You specify these settings when you create an instance group. They can't be changed later. You can, however, add instances of the same type and purchasing option to core and task instance groups. You can also remove instances.

If the cluster's On-Demand Instances match the attributes of open capacity reservations (instance type, platform, tenancy and Availability Zone) available in your account, the capacity reservations are applied automatically. You can use open capacity reservations for primary, core, and task nodes. However, you cannot use targeted capacity reservations or prevent instances from launching into open capacity reservations with matching attributes when you provision clusters using instance groups. If you want to use targeted capacity reservations or prevent instances from launching into open capacity reservations, use Instance Fleets instead. For more information, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

To add different instance types after a cluster is created, you can add additional task instance groups. You can choose different instance types and purchasing options for each instance group. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).

When launching instances, the On-Demand Instance's capacity reservation preference defaults to `open`, which enables it to run in any open capacity reservation that has matching attributes (instance type, platform, Availability Zone). For more information about On-Demand Capacity Reservations, see [Use capacity reservations with instance fleets in Amazon EMR](on-demand-capacity-reservations.md).

This section covers creating a cluster with uniform instance groups. For more information about modifying an existing instance group by adding or removing instances manually or with automatic scaling, see [Manage Amazon EMR clusters](emr-manage.md).

## Use the console to configure uniform instance groups


------
#### [ Console ]

**To create a cluster with instance groups with the new console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and choose **Create cluster**.

1. Under **Cluster configuration**, choose **Instance groups**.

1. Under **Node groups**, there is a section for each type of node group. For the primary node group, select the **Use multiple primary nodes** check box if you want to have 3 primary nodes. Select the **Use Spot purchasing option** check box if you want to use Spot purchasing.

1. For the primary and core node groups, select **Add instance type** and choose up to 5 instance types. For the task group, select **Add instance type** and choose up to fifteen instance types. Amazon EMR might provision any mix of these instance types when it launches the cluster.

1. Under each node group type, choose the **Actions** dropdown menu next to each instance to change these settings:  
**Add EBS volumes**  
Specify EBS volumes to attach to the instance type after Amazon EMR provisions it.  
**Edit maximum Spot price**  
Specify a maximum Spot price for each instance type in a fleet. You can set this price either as a percentage of the On-Demand price, or as a specific dollar amount. If the current Spot price in an Availability Zone is below your maximum Spot price, Amazon EMR provisions Spot Instances. You pay the Spot price, not necessarily the maximum Spot price.

1. Optionally, expand **Node configuration** to enter a JSON configuration or to load JSON from Amazon S3.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------

## Use the AWS CLI to create a cluster with uniform instance groups


To specify the instance groups configuration for a cluster using the AWS CLI, use the `create-cluster` command along with the `--instance-groups` parameter. Amazon EMR assumes the On-Demand Instance option unless you specify the `BidPrice` argument for an instance group. For examples of `create-cluster` commands that launch uniform instance groups with On-Demand Instances and a variety of cluster options, type `aws emr create-cluster help `at the command line, or see [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) in the *AWS CLI Command Reference*.

You can use the AWS CLI to create uniform instance groups in a cluster that use Spot Instances. The offered Spot price depends on Availability Zone. When you use the CLI or API, you can specify the Availability Zone either with the `AvailabilityZone` argument (if you're using an EC2-classic network) or the `SubnetID `argument of the `--ec2-attributes `parameter. The Availability Zone or subnet that you select applies to the cluster, so it's used for all instance groups. If you don't specify an Availability Zone or subnet explicitly, Amazon EMR selects the Availability Zone with the lowest Spot price when it launches the cluster.

The following example demonstrates a `create-cluster` command that creates primary, core, and two task instance groups that all use Spot Instances. Replace *myKey* with the name of your Amazon EC2 key pair. 

**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

```
aws emr create-cluster --name "MySpotCluster" \
  --release-label emr-7.12.0 \
  --use-default-roles \
  --ec2-attributes KeyName=myKey \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,BidPrice=0.25 \
    InstanceGroupType=CORE,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=4,BidPrice=0.03 \
    InstanceGroupType=TASK,InstanceType=m5.xlarge,InstanceCount=2,BidPrice=0.04
```

Using the CLI, you can create uniform instance group clusters that specify a unique custom AMI for each instance type in the instance group. This allows you to use different instance architectures in the same instance group. Each instance type must use a custom AMI with a matching architecture. For example, you would configure an m5.xlarge instance type with an x86\$164 architecture custom AMI, and an m6g.xlarge instance type with a corresponding `AWS AARCH64` (ARM) architecture custom AMI. 

The following example shows a uniform instance group cluster created with two instance types, each with its own custom AMI. Notice that the custom AMIs are specified only at the instance type level, not at the cluster level. This is to avoid conflicts between the instance type AMIs and an AMI at the cluster level, which would cause the cluster launch to fail. 

```
aws emr create-cluster
  --release-label emr-5.30.0 \
  --service-role EMR_DefaultRole \
  --ec2-attributes SubnetId=subnet-22XXXX01,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456 \
    InstanceGroupType=CORE,InstanceType=m6g.xlarge,InstanceCount=1,CustomAmiId=ami-234567
```

You can add multiple custom AMIs to an instance group that you add to a running cluster. The `CustomAmiId` argument can be used with the `add-instance-groups` command as shown in the following example.

```
aws emr add-instance-groups --cluster-id j-123456 \
  --instance-groups \
    InstanceGroupType=Task,InstanceType=m5.xlarge,InstanceCount=1,CustomAmiId=ami-123456
```

## Use the Java SDK to create an instance group


You instantiate an `InstanceGroupConfig` object that specifies the configuration of an instance group for a cluster. To use Spot Instances, you set the `withBidPrice` and `withMarket` properties on the `InstanceGroupConfig` object. The following code shows how to define primary, core, and task instance groups that run Spot Instances.

```
InstanceGroupConfig instanceGroupConfigMaster = new InstanceGroupConfig()
	.withInstanceCount(1)
	.withInstanceRole("MASTER")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.25"); 
	
InstanceGroupConfig instanceGroupConfigCore = new InstanceGroupConfig()
	.withInstanceCount(4)
	.withInstanceRole("CORE")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.03");
	
InstanceGroupConfig instanceGroupConfigTask = new InstanceGroupConfig()
	.withInstanceCount(2)
	.withInstanceRole("TASK")
	.withInstanceType("m4.large")
	.withMarket("SPOT")
	.withBidPrice("0.10");
```

# Availability Zone flexibility for an Amazon EMR cluster
Availability Zone flexibility for launching instances

Each AWS Region has multiple, isolated locations known as Availability Zones. When you launch an instance, you can optionally specify an Availability Zone (AZ) in the AWS Region that you use. [Availability Zone flexibility](#emr-flexibility-az) is the distribution of instances across multiple AZs. If one instance fails, you can design your application so that an instance in another AZ can handle requests. For more information on Availability Zones, see the [Region and zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) documentation in the *Amazon EC2 User Guide*.

[Instance flexibility](#emr-flexibility-types) is the use of multiple instance types to satisfy capacity requirements. When you express flexibility with instances, you can use aggregate capacity across instance sizes, families, and generations. Greater flexibility improves the chance to find and allocate your required amount of compute capacity when compared with a cluster that uses a single instance type.

Instance and Availability Zone flexibility reduces [insufficient capacity errors (ICE)](emr-events-response-insuff-capacity.md) and Spot interruptions when compared to a cluster with a single instance type or AZ. Use the best practices covered here to determine which instances to diversify after you know the initial instance family and size. This approach maximizes availability to Amazon EC2 capacity pools with minimal performance and cost variance.

## Being flexible about Availability Zones
Availability Zone flexibility

We recommend that you configure all Availability Zones for use in your virtual private cloud (VPC) and that you select them for your EMR cluster. Clusters must exist in only one Availability Zone, but with Amazon EMR instance fleets, you can select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options that you specify. When you provision an EMR cluster for multiple subnets, your cluster can access a deeper Amazon EC2 capacity pool when compared to clusters in a single subnet. 

If you must prioritize a certain number of Availability Zones for use in your virtual private cloud (VPC) for your EMR cluster, you can leverage the Spot placement score capability with Amazon EC2. With Spot placement scoring, you specify the compute requirements for your Spot Instances, then EC2 returns the top ten AWS Regions or Availability Zones scored on a scale from 1 to 10. A score of 10 indicates that your Spot request is highly likely to succeed; a score of 1 indicates that your Spot request is not likely to succeed. For more information on how to use Spot placement scoring, see [Spot placement score](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-placement-score.html) in the *Amazon EC2 User Guide*.

## Being flexible about instance types
Instance flexibility

Instance flexibility is the use of multiple instance types to satisfy capacity requirements. Instance flexibility benefits both Amazon EC2 Spot and On-Demand Instance usage. With Spot Instances, instance flexibility lets Amazon EC2 launch instances from deeper capacity pools using real-time capacity data. It also predicts which instances are most available. This offers fewer interruptions and can reduce the overall cost of a workload. With On-Demand Instances, instance flexibility reduces insufficient capacity errors (ICE) when total capacity provisions across a greater number of instance pools.

For **Instance Group** clusters, you can specify up to 50 EC2 instance types. For **Instance Fleets** with allocation strategy, you can specify up to 30 EC2 instance types for each primary, core, and task node group. A broader range of instances improves the benefits of instance flexibility. 

### Expressing instance flexibility
Flexibility expression

Consider the following best practices to express instance flexibility for your application.

**Topics**
+ [

#### Determine instance family and size
](#emr-flexibility-express-size)
+ [

#### Include additional instances
](#emr-flexibility-express-include)

#### Determine instance family and size
Instance family and size

Amazon EMR supports several instance types for different use cases. These instance types are listed in the [Supported instance types with Amazon EMR](emr-supported-instance-types.md) documentation. Each instance type belongs to an instance family that describes what application the type is optimized for.

For new workloads, you should benchmark with instance types in the general purpose family, such as `m5` or `c5`. Then, monitor the OS and YARN metrics from Ganglia and Amazon CloudWatch to determine system bottlenecks at peak load. Bottlenecks include CPU, memory, storage, and I/O operations. After you identify the bottlenecks, choose compute optimized, memory optimized, storage optimized, or another appropriate instance family for your instance types. For more details, see the [Determine right infrastructure for your Spark workloads](https://github.com/aws/aws-emr-best-practices/blob/main/website/docs/bestpractices/Applications/Spark/best_practices.md#bp-512-----determine-right-infrastructure-for-your-spark-workloads) page in the Amazon EMR best practices guide on GitHub. 

Next, identify the smallest YARN container or Spark executor that your application requires. This is the smallest instance size that fits the container and the minimum instance size for the cluster. Use this metric to determine instances that you can further diversify with. A smaller instance will allow for more instance flexibility.

For maximum instance flexibility, you should leverage as many instances as possible. We recommend that you diversify with instances that have similar hardware specifications. This maximizes access to EC2 capacity pools with minimal cost and performance variance. Diversify across sizes. To do so, prioritize AWS Graviton and previous generations first. As a general rule, try to be flexible across at least 15 instance types for each workload. We recommend that you start with general purpose, compute optimized, or memory optimized instances. These instance types will provide the greatest flexibility. 

#### Include additional instances
Include additional instance types

For maximum diversity, include additional instance types. Prioritize instance size, Graviton, and generation flexibility first. This allows access to additional EC2 capacity pools with similar cost and performance profiles. If you need further flexibility due to ICE or spot interruptions, consider variant and family flexibility. Each approach has tradeoffs that depend on your use case and requirements. 
+ **Size flexibility** – First, diversify with instances of different sizes within the same family. Instances within the same family provide the same cost and performance, but can launch a different number of containers on each host. For example, if the minimum executor size that you need is 2vCPU and 8Gb memory, the minimum instance size is `m5.xlarge`. For size flexibility, include `m5.xlarge`, `m5.2xlarge`, `m5.4xlarge`, `m5.8xlarge`, `m5.12xlarge`, `m5.16xlarge`, and `m5.24xlarge`.
+ **Graviton flexibility** – In addition to size, you can diversify with Graviton instances. Graviton instances are powered by AWS Graviton2 processors that deliver the best price performance for cloud workloads in Amazon EC2. For example, with the minimum instance size of `m5.xlarge`, you can include `m6g.xlarge`, `m6g.2xlarge`, `m6g.4xlarge`, `m6g.8xlarge`, and `m6g.16xlarge` for Graviton flexibility.
+ **Generation flexibility** – Similar to Graviton and size flexibility, instances in previous generation families share the same hardware specifications. This results in a similar cost and performance profile with an increase in the total accessible Amazon EC2 pool. For generation flexibility, include `m4.xlarge`, `m4.2xlarge`, `m4.10xlarge`, and `m4.16xlarge`.
+ **Family and variant flexibility**
  + **Capacity** – To optimize for capacity, we recommend instance flexibility across instance families. Common instances from different instance families have deeper instance pools that can assist with meeting capacity requirements. However, instances from different families will have different vCPU to memory ratios. This results in under-utilization if the expected application container is sized for a different instance. For example, with `m5.xlarge`, include compute-optimized instances such as `c5` or memory-optimized instances such as `r5` for instance family flexibility.
  + **Cost** – To optimize for cost, we recommend instance flexibility across variants. These instances have the same memory and vCPU ratio as the initial instance. The tradeoff with variant flexibility is that these instances have smaller capacity pools which might result in limited additional capacity or higher Spot interruptions. With `m5.xlarge` for example, include AMD-based instances (`m5a`), SSD-based instances (`m5d`) or network-optimized instances (`m5n`) for instance variant flexibility.

# Configuring Amazon EMR cluster instance types and best practices for Spot instances
Guidelines and best practices

Use the guidance in this section to help you determine the instance types, purchasing options, and amount of storage to provision for each node type in an EMR cluster.

## What instance type should you use?


There are several ways to add Amazon EC2 instances to a cluster. The method you should choose depends on whether you use the instance groups configuration or the instance fleets configuration for the cluster.
+ **Instance Groups**
  + Manually add instances of the same type to existing core and task instance groups.
  + Manually add a task instance group, which can use a different instance type.
  + Set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. For more information, see [Use Amazon EMR cluster scaling to adjust for changing workloads](emr-scale-on-demand.md).
+ **Instance Fleets**
  + Add a single task instance fleet.
  + Change the target capacity for On-Demand and Spot Instances for existing core and task instance fleets. For more information, see [Planning and configuring instance fleets for your Amazon EMR cluster](emr-instance-fleet.md).

One way to plan the instances of your cluster is to run a test cluster with a representative sample set of data and monitor the utilization of the nodes in the cluster. For more information, see [View and monitor an Amazon EMR cluster as it performs work](emr-manage-view.md). Another way is to calculate the capacity of the instances you are considering and compare that value against the size of your data.

In general, the primary node type, which assigns tasks, doesn't require an EC2 instance with much processing power; Amazon EC2 instances for the core node type, which process tasks and store data in HDFS, need both processing power and storage capacity; Amazon EC2 instances for the task node type, which don't store data, need only processing power. For guidelines about available Amazon EC2 instances and their configuration, see [Configure Amazon EC2 instance types for use with Amazon EMR](emr-plan-ec2-instances.md).

 The following guidelines apply to most Amazon EMR clusters. 
+ There is a vCPU limit for the total number of on-demand Amazon EC2 instances that you run on an AWS account per AWS Region. For more information about the vCPU limit and how to request a limit increase for your account, see [On-Demand Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html) in the *Amazon EC2* *User Guide for Linux Instances*. 
+ The primary node does not typically have large computational requirements. For clusters with a large number of nodes, or for clusters with applications that are specifically deployed on the primary node (JupyterHub, Hue, etc.), a larger primary node may be required and can help improve cluster performance. For example, consider using an m5.xlarge instance for small clusters (50 or fewer nodes), and increasing to a larger instance type for larger clusters.
+ The computational needs of the core and task nodes depend on the type of processing your application performs. Many jobs can be run on general purpose instance types, which offer balanced performance in terms of CPU, disk space, and input/output. Computation-intensive clusters may benefit from running on High CPU instances, which have proportionally more CPU than RAM. Database and memory-caching applications may benefit from running on High Memory instances. Network-intensive and CPU-intensive applications like parsing, NLP, and machine learning may benefit from running on cluster compute instances, which provide proportionally high CPU resources and increased network performance.
+ If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your job flow's varying capacity requirements. 
+ The amount of data you can process depends on the capacity of your core nodes and the size of your data as input, during processing, and as output. The input, intermediate, and output datasets all reside on the cluster during processing. 

## When should you use Spot Instances?


When you launch a cluster in Amazon EMR, you can choose to launch primary, core, or task instances on Spot Instances. Because each type of instance group plays a different role in the cluster, there are implications of launching each node type on Spot Instances. You can't change an instance purchasing option while a cluster is running. To change from On-Demand to Spot Instances or vice versa, for the primary and core nodes, you must terminate the cluster and launch a new one. For task nodes, you can launch a new task instance group or instance fleet, and remove the old one.

**Topics**
+ [

### Amazon EMR settings to prevent job failure because of task node Spot Instance termination
](#emr-plan-spot-YARN)
+ [

### Primary node on a Spot Instance
](#emr-dev-master-instance-group-spot)
+ [

### Core nodes on Spot Instances
](#emr-dev-core-instance-group-spot)
+ [

### Task nodes on Spot Instances
](#emr-dev-task-instance-group-spot)
+ [

### Instance configurations for application scenarios
](#emr-plan-spot-scenarios)

### Amazon EMR settings to prevent job failure because of task node Spot Instance termination


Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

Amazon EMR configures the following properties and values by default. Use caution when configuring these properties.

**Note**  
Beginning with Amazon EMR 6.x release series, the YARN node labels feature is disabled by default. The application primary processes can run on both core and task nodes by default. You can enable the YARN node labels feature by configuring following properties:   
`yarn.node-labels.enabled: true`
`yarn.node-labels.am.default-node-label-expression: 'CORE'`
+ **yarn-site (yarn-site.xml) On All Nodes**
  + `yarn.node-labels.enabled: true`
  + `yarn.node-labels.am.default-node-label-expression: 'CORE'`
  + `yarn.node-labels.fs-store.root-dir: '/apps/yarn/nodelabels'`
  + `yarn.node-labels.configuration-type: 'distributed'`
+ **yarn-site (yarn-site.xml) On Primary And Core Nodes**
  + `yarn.nodemanager.node-labels.provider: 'config'`
  + `yarn.nodemanager.node-labels.provider.configured-node-partition: 'CORE'`
+ **capacity-scheduler (capacity-scheduler.xml) On All Nodes**
  + `yarn.scheduler.capacity.root.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels: '*'`
  + `yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity: 100`

### Primary node on a Spot Instance


The primary node controls and directs the cluster. When it terminates, the cluster ends, so you should only launch the primary node as a Spot Instance if you are running a cluster where sudden termination is acceptable. This might be the case if you are testing a new application, have a cluster that periodically persists data to an external store such as Amazon S3, or are running a cluster where cost is more important than ensuring the cluster's completion. 

When you launch the primary instance group as a Spot Instance, the cluster does not start until that Spot Instance request is fulfilled. This is something to consider when selecting your maximum Spot price.

You can only add a Spot Instance primary node when you launch the cluster. You can't add or remove primary nodes from a running cluster. 

Typically, you would only run the primary node as a Spot Instance if you are running the entire cluster (all instance groups) as Spot Instances. 

### Core nodes on Spot Instances


Core nodes process data and store information using HDFS. Terminating a core instance risks data loss. For this reason, you should only run core nodes on Spot Instances when partial HDFS data loss is tolerable.

When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provision all of the requested core instances before launching the instance group. In other words, if you request six Amazon EC2 instances, and only five are available at or below your maximum Spot price, the instance group won't launch. Amazon EMR continues to wait until all six Amazon EC2 instances are available or until you terminate the cluster. You can change the number of Spot Instances in a core instance group to add capacity to a running cluster. For more information about working with instance groups, and how Spot Instances work with instance fleets, see [Create an Amazon EMR cluster with instance fleets or uniform instance groups](emr-instance-group-configuration.md).

### Task nodes on Spot Instances


The task nodes process data but do not hold persistent data in HDFS. If they terminate because the Spot price has risen above your maximum Spot price, no data is lost and the effect on your cluster is minimal.

When you launch one or more task instance groups as Spot Instances, Amazon EMR provisions as many task nodes as it can, using your maximum Spot price. This means that if you request a task instance group with six nodes, and only five Spot Instances are available at or below your maximum Spot price, Amazon EMR launches the instance group with five nodes, adding the sixth later if possible. 

Launching task instance groups as Spot Instances is a strategic way to expand the capacity of your cluster while minimizing costs. If you launch your primary and core instance groups as On-Demand Instances, their capacity is guaranteed for the run of the cluster. You can add task instances to your task instance groups as needed, to handle peak traffic or speed up data processing. 

You can add or remove task nodes using the console, AWS CLI, or API. You can also add additional task groups, but you cannot remove a task group after it is created. 

### Instance configurations for application scenarios


The following table is a quick reference to node type purchasing options and configurations that are usually appropriate for various application scenarios. Choose the link to view more information about each scenario type.


| Application scenario | Primary node purchasing option | Core nodes purchasing option | Task nodes purchasing option | 
| --- | --- | --- | --- | 
| [Long-running clusters and data warehouses](#emr-dev-when-use-spot-data-warehouses) | On-Demand | On-Demand or instance-fleet mix | Spot or instance-fleet mix | 
| [Cost-driven workloads](#emr-dev-when-use-spot-cost-driven) | Spot | Spot | Spot | 
| [Data-critical workloads](#emr-dev-when-use-spot-data-critical) | On-Demand | On-Demand | Spot or instance-fleet mix | 
| [Application testing](#emr-dev-when-use-spot-application-testing) | Spot | Spot | Spot | 

 There are several scenarios in which Spot Instances are useful for running an Amazon EMR cluster. 

#### Long-running clusters and data warehouses


If you are running a persistent Amazon EMR cluster that has a predictable variation in computational capacity, such as a data warehouse, you can handle peak demand at lower cost with Spot Instances. You can launch your primary and core instance groups as On-Demand Instances to handle the normal capacity and launch the task instance group as Spot Instances to handle your peak load requirements.

#### Cost-driven workloads


If you are running transient clusters for which lower cost is more important than the time to completion, and losing partial work is acceptable, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to benefit from the largest cost savings.

#### Data-critical workloads


If you are running a cluster for which lower cost is more important than time to completion, but losing partial work is not acceptable, launch the primary and core instance groups as On-Demand Instances and supplement with one or more task instance groups of Spot Instances. Running the primary and core instance groups as On-Demand Instances ensures that your data is persisted in HDFS and that the cluster is protected from termination due to Spot market fluctuations, while providing cost savings that accrue from running the task instance groups as Spot Instances.

#### Application testing


When you are testing a new application in order to prepare it for launch in a production environment, you can run the entire cluster (primary, core, and task instance groups) as Spot Instances to reduce your testing costs.

## Calculating the required HDFS capacity of a cluster


 The amount of HDFS storage available to your cluster depends on the following factors:
+ The number of Amazon EC2 instances used for core nodes.
+ The capacity of the Amazon EC2 instance store for the instance type used. For more information on instance store volumes, see [Amazon Amazon EC2 instance store](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) in the *Amazon EC2 User Guide*.
+ The number and size of Amazon EBS volumes attached to core nodes.
+ A replication factor, which accounts for how each data block is stored in HDFS for RAID-like redundancy. By default, the replication factor is three for a cluster of 10 or more core nodes, two for a cluster of 4-9 core nodes, and one for a cluster of three or fewer nodes.

To calculate the HDFS capacity of a cluster, for each core node, add the instance store volume capacity to the Amazon EBS storage capacity (if used). Multiply the result by the number of core nodes, and then divide the total by the replication factor based on the number of core nodes. For example, a cluster with 10 core nodes of type i2.xlarge, which have 800 GB of instance storage without any attached Amazon EBS volumes, has a total of approximately 2,666 GB available for HDFS (10 nodes x 800 GB ÷ 3 replication factor).

 If the calculated HDFS capacity value is smaller than your data, you can increase the amount of HDFS storage in the following ways: 
+ Creating a cluster with additional Amazon EBS volumes or adding instance groups with attached Amazon EBS volumes to an existing cluster
+ Adding more core nodes
+ Choosing an Amazon EC2 instance type with greater storage capacity
+ Using data compression
+ Changing the Hadoop configuration settings to reduce the replication factor

Reducing the replication factor should be used with caution as it reduces the redundancy of HDFS data and the ability of the cluster to recover from lost or corrupted HDFS blocks. 