

# Instance storage options and behavior in Amazon EMR
<a name="emr-plan-storage"></a>

## Overview
<a name="emr-plan-storage-ebs-storage-overview"></a>

Instance store and Amazon EBS volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications might "spill" to the local file system.

Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. Amazon EBS volumes attached to Amazon EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so you shouldn't expect data to persist. Although the data is ephemeral, it is possible that data in HDFS could be replicated depending on the number and specialization of nodes in the cluster. When you add Amazon EBS storage volumes, these are mounted as additional volumes. They are not a part of the boot volume. YARN is configured to use all the additional volumes, but you are responsible for allocating the additional volumes as local storage (for local log files, for example).

## Considerations
<a name="emr-plan-storage-ebs-storage-considerations"></a>

Keep in mind these additional considerations when you use Amazon EBS with EMR clusters:
+ You can't snapshot an Amazon EBS volume and then restore it within Amazon EMR. To create reusable custom configurations, use a custom AMI (available in Amazon EMR version 5.7.0 and later). For more information, see [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md).
+ An encrypted Amazon EBS root device volume is supported only when using a custom AMI. For more information, see [Creating a custom AMI with an encrypted Amazon EBS root device volume](emr-custom-ami.md#emr-custom-ami-encrypted). 
+ If you apply tags using the Amazon EMR API, those operations are applied to EBS volumes.
+ There is a limit of 25 volumes per instance.
+ The Amazon EBS volumes on core nodes cannot be less than 5 GB.
+ Amazon EBS has a fixed limit of 2,500 EBS volumes per instance launch request. This limit also applies to Amazon EMR on EC2 clusters. We recommend that you launch clusters with the total number of EBS volumes within this limit, and then manually scale up the cluster or with Amazon EMR managed scaling as needed. To learn more about the EBS volume limit, see [Service quotas](https://docs.aws.amazon.com/general/latest/gr/ebs-service.html#limits_ebs:~:text=Amazon%20EBS%20has,exceeding%20the%20limit.).

## Default Amazon EBS storage for instances
<a name="emr-plan-storage-ebs-storage-default"></a>

For EC2 instances that have EBS-only storage, Amazon EMR allocates Amazon EBS gp2 or gp3 storage volumes to instances. When you create a cluster with Amazon EMR releases 5.22.0 and higher, the default amount of Amazon EBS storage increases relative to the size of the instance.

We split any increased storage across multiple volumes. This gives increased IOPS performance and, in turn, increased performance for some standardized workloads. If you want to use a different Amazon EBS instance storage configuration, you can specify this when you create an EMR cluster or add nodes to an existing cluster. You can use Amazon EBS gp2 or gp3 volumes as root volumes, and add gp2 or gp3 volumes as additional volumes. For more information, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

The following table identifies the default number of Amazon EBS gp2 storage volumes, sizes, and total sizes per instance type. For information about gp2 volumes compared to gp3, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md).


**Default Amazon EBS gp2 storage volumes and size by instance type for Amazon EMR 5.22.0 and higher**  

| Instance size | Number of volumes | Volume size (GiB) | Total size (GiB) | 
| --- | --- | --- | --- | 
|  \$1.large  |  1  |  32  |  32  | 
|  \$1.xlarge  |  2  |  32  |  64  | 
|  \$1.2xlarge  |  4  |  32  |  128  | 
|  \$1.4xlarge  |  4  |  64  |  256  | 
|  \$1.8xlarge  |  4  |  128  |  512  | 
|  \$1.9xlarge  |  4  |  144  |  576  | 
|  \$1.10xlarge  |  4  |  160  |  640  | 
|  \$1.12xlarge  |  4  |  192  |  768  | 
|  \$1.16xlarge  |  4  |  256  |  1024  | 
|  \$1.18xlarge  |  4  |  288  |  1152  | 
|  \$1.24xlarge  |  4  |  384  |  1536  | 

## Default Amazon EBS root volume for instances
<a name="emr-plan-storage-ebs-root-volume"></a>

With Amazon EMR releases 6.15 and higher, Amazon EMR automatically attaches an Amazon EBS General Purpose SSD (gp3) as the root device for its AMIs to enhance performance. With earlier releases, Amazon EMR attaches EBS General Purpose SSD (gp2) as the root device.


|  | 6.15 and higher | 6.14 and lower | 
| --- | --- | --- | 
| Default root volume type |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) | 
| Default size |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  | 
| Default IOPS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 
| Default throughput |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)  |   | 

For information on how to customize the Amazon EBS root device volume, see [Specifying additional EBS storage volumes](#emr-plan-storage-additional-ebs-volumes).

## Specifying additional EBS storage volumes
<a name="emr-plan-storage-additional-ebs-volumes"></a>

When you configure instance types in Amazon EMR, you can specify additional EBS volumes to add capacity beyond the instance store (if present) and the default EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price, so you can tailor your storage to the analytic and business needs of your applications. For example, some applications might need to spill to disk while others can safely work in-memory or with Amazon S3.

You can only attach Amazon EBS volumes to instances at cluster startup time and when you add an extra task node instance group. If an instance in an Amazon EMR cluster fails, then both the instance and attached Amazon EBS volumes are replaced with new volumes. Consequently, if you manually detach an Amazon EBS volume, Amazon EMR treats that as a failure and replaces both instance storage (if applicable) and the volume stores.

Amazon EMR doesn’t allow you to modify your volume type from gp2 to gp3 for an existing EMR cluster. To use gp3 for your workloads, launch a new EMR cluster. In addition, we don't recommend that you update the throughput and IOPS on a cluster that is in use or that is being provisioned, because Amazon EMR uses the throughput and IOPS values you specify at cluster launch time for any new instance that it adds during cluster scale-up. For more information, see [Comparing Amazon EBS volume types gp2 and gp3](emr-plan-storage-compare-volume-types.md) and [Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types](emr-plan-storage-gp3-migration-selection.md).

**Important**  
To use a gp3 volume with your EMR cluster, you must launch a new cluster.

# Comparing Amazon EBS volume types gp2 and gp3
<a name="emr-plan-storage-compare-volume-types"></a>

Here is a comparison of cost between gp2 and gp3 volumes in the US East (N. Virginia) Region. For the most up to date information, see the [Amazon EBS General Purpose Volumes](https://aws.amazon.com/ebs/general-purpose/) product page and the [Amazon EBS Pricing Page](https://aws.amazon.com/ebs/pricing/).


| Volume type | gp3 | gp2 | 
| --- | --- | --- | 
| Volume size | 1 GiB – 16 TiB | 1 GiB – 16 TiB | 
| Default/Baseline IOPS | 3000 | 3 IOPS/GiB (minimum 100 IOPS) to a maximum of 16,000 IOPS. Volumes smaller than 1 TiB can also burst up to 3,000 IOPS. | 
| Max IOPS/volume | 16,000 | 16,000 | 
| Default/Baseline throughput | 125 MiB/s | Throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size. | 
| Max throughput/volume | 1,000 MiB/s | 250 MiB/s | 
| Price | \$10.08/GiB-month 3,000 IOPS free and \$10.005/provisioned IOPS-month over 3,000; 125 MiB/s free and \$10.04/provisioned MiB/s-month over 125MiB/s | \$10.10/GiB-month | 

# Selecting IOPS and throughput when migrating to gp3 Amazon EBS volume types
<a name="emr-plan-storage-gp3-migration-selection"></a>

When provisioning a gp2 volume, you must figure out the size of the volume in order to get the proportional IOPS and throughput. With gp3, you don’t have to provision a bigger volume to get higher performance. You can choose your desired size and performance according to application need. Selecting the right size and right performance parameters (IOPS, throughput) can provide you maximum cost reduction, without affecting performance.

Here is a table to help you select gp3 configuration options:


| Volume size | IOPS | Throughput | 
| --- | --- | --- | 
| 1–170 GiB | 3000 | 125 MiB/s | 
| 170–334 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, use higher as per usage, Max 250 MiB/s\$1. | 
| 334–1000 GiB | 3000 | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 
| 1000\$1 GiB | Match gp2 IOPS (Size in GiB x 3) or Max IOPS driven by current gp2 volume | 125 MiB/s if the chosen EC2 instance type supports 125MiB/s or less, Use higher as per usage, Max 250 MiB/s\$1. | 

\$1Gp3 has the capability to provide throughput up to 2000 MiB/s. Since gp2 provides a maximum of 250MiB/s throughput, you may not need to go beyond this limit when you use gp3. Gp3 volumes deliver a consistent baseline throughput performance of 125 MiB/s, which is included with the price of storage. You can provision additional throughput (up to a maximum of 2,000 MiB/s) for an additional cost at a ratio of 0.25 MiB/s per provisioned IOPS. Maximum throughput can be provisioned at 8,000 IOPS or higher and 16 GiB or larger (8,000 IOPS × 0.25 MiB/s per IOPS = 2,000 MiB/s).