# What is Amazon EMR?
<a name="emr-what-is-emr"></a>

Amazon EMR, which was previously called Amazon Elastic MapReduce, is a managed cluster platform that simplifies running big data frameworks, such as [Apache Hadoop](https://aws.amazon.com/elasticmapreduce/details/hadoop) and [Apache Spark](https://aws.amazon.com/elasticmapreduce/details/spark), on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. 

If you are a first-time user of Amazon EMR, we recommend that you begin by reading the following, in addition to this section:
+ [Amazon EMR](https://aws.amazon.com/elasticmapreduce/) – This service page provides Amazon EMR highlights, product details, and pricing information.
+ [Tutorial: Getting started with Amazon EMR](emr-gs.md) – This tutorial gets you started using Amazon EMR quickly.

**Topics**
+ [Understanding how to create and work with Amazon EMR clusters](emr-overview.md)
+ [Benefits of using Amazon EMR](emr-overview-benefits.md)
+ [Amazon EMR architecture and service layers](emr-overview-arch.md)

# Understanding how to create and work with Amazon EMR clusters
<a name="emr-overview"></a>

This topic provides an overview of Amazon EMR clusters, including how to submit work to a cluster, how that data is processed, and the various states that the cluster goes through during processing. 

**Topics**
+ [Getting familiar with clusters and nodes](#emr-overview-clusters)
+ [Submitting work to a cluster](#emr-work-cluster)
+ [Processing data](#emr-overview-data-processing)
+ [Understanding the cluster lifecycle](#emr-overview-cluster-lifecycle)

## Getting familiar with clusters and nodes
<a name="emr-overview-clusters"></a>

The central component of Amazon EMR is the *cluster*. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a *node*. Each node has a role within the cluster, referred to as the *node type*. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.

 The node types in Amazon EMR are as follows: 
+ **Primary node**: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The primary node tracks the status of tasks and monitors the health of the cluster. Every cluster has a primary node, and it's possible to create a single-node cluster with only the primary node.
+ **Core node**: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
+ **Task node**: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

## Submitting work to a cluster
<a name="emr-work-cluster"></a>

When you run a cluster on Amazon EMR, you have several options as to how you specify the work that needs to be done. 
+ Provide the entire definition of the work to be done in functions that you specify as steps when you create a cluster. This is typically done for clusters that process a set amount of data and then terminate when processing is complete. 
+ Create a long-running cluster and use the Amazon EMR console, the Amazon EMR API, or the AWS CLI to submit steps, which may contain one or more jobs. For more information, see [Submit work to an Amazon EMR cluster](emr-work-with-steps.md). 
+ Create a cluster, connect to the primary node and other nodes as required using SSH, and use the interfaces that the installed applications provide to perform tasks and submit queries, either scripted or interactively. For more information, see the [Amazon EMR Release Guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/). 

## Processing data
<a name="emr-overview-data-processing"></a>

When you launch your cluster, you choose the frameworks and applications to install for your data processing needs. To process data in your Amazon EMR cluster, you can submit jobs or queries directly to installed applications, or you can run *steps* in the cluster.

### Submitting jobs directly to applications
<a name="emr-overview-submitting-jobs"></a>

You can submit jobs and interact directly with the software that is installed in your Amazon EMR cluster. To do this, you typically connect to the primary node over a secure connection and access the interfaces and tools that are available for the software that runs directly on your cluster. For more information, see [Connect to an Amazon EMR cluster](emr-connect-master-node.md).

### Running steps to process data
<a name="emr-overview-steps"></a>

You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

 The following is an example process using four steps: 

1. Submit an input dataset for processing.

1. Process the output of the first step by using a Pig program.

1. Process a second input dataset by using a Hive program.

1. Write an output dataset.

Generally, when you process data in Amazon EMR, the input is data stored as files in your chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step to the next in the processing sequence. The final step writes the output data to a specified location, such as an Amazon S3 bucket.

 Steps are run in the following sequence: 

1. A request is submitted to begin processing steps.

1. The state of all steps is set to **PENDING**.

1. When the first step in the sequence starts, its state changes to **RUNNING**. The other steps remain in the **PENDING** state.

1. After the first step completes, its state changes to **COMPLETED**.

1. The next step in the sequence starts, and its state changes to **RUNNING**. When it completes, its state changes to **COMPLETED**.

1. This pattern repeats for each step until they all complete and processing ends.

The following diagram represents the step sequence and change of state for the steps as they are processed. 

![\[Sequence diagram for Amazon EMR showing the different cluster step states.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/step-sequence.png)


If a step fails during processing, its state changes to **FAILED**. You can determine what happens next for each step. By default, any remaining steps in the sequence are set to **CANCELLED** and do not run if a preceeding step fails. You can also choose to ignore the failure and allow remaining steps to proceed, or to terminate the cluster immediately.

The following diagram represents the step sequence and default change of state when a step fails during processing. 

![\[Sequence diagram for Amazon EMR showing what happens to subsequent steps when a preceeding cluster step fails.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/step-sequence-failed.png)


## Understanding the cluster lifecycle
<a name="emr-overview-cluster-lifecycle"></a>

 A successful Amazon EMR cluster follows this process: 

1. Amazon EMR first provisions EC2 instances in the cluster for each instance according to your specifications. For more information, see [Configure Amazon EMR cluster hardware and networking](emr-plan-instances.md). For all instances, Amazon EMR uses the default AMI for Amazon EMR or a custom Amazon Linux AMI that you specify. For more information, see [Using a custom AMI to provide more flexibility for Amazon EMR cluster configuration](emr-custom-ami.md). During this phase, the cluster state is `STARTING`.

1. Amazon EMR runs *bootstrap actions* that you specify on each instance. You can use bootstrap actions to install custom applications and perform customizations that you require. For more information, see [Create bootstrap actions to install additional software with an Amazon EMR cluster](emr-plan-bootstrap.md). During this phase, the cluster state is `BOOTSTRAPPING`. 

1. Amazon EMR installs the native applications that you specify when you create the cluster, such as Hive, Hadoop, Spark, and so on.

1. After bootstrap actions are successfully completed and native applications are installed, the cluster state is `RUNNING`. At this point, you can connect to cluster instances, and the cluster sequentially runs any steps that you specified when you created the cluster. You can submit additional steps, which run after any previous steps complete. For more information, see [Submit work to an Amazon EMR cluster](emr-work-with-steps.md). 

1. After steps run successfully, the cluster goes into a `WAITING` state. If a cluster is configured to auto-terminate after the last step is complete, it goes into a `TERMINATING` state and then into the `TERMINATED` state. If the cluster is configured to wait, you must manually shut it down when you no longer need it. After you manually shut down the cluster, it goes into the `TERMINATING` state and then into the `TERMINATED` state.

A failure during the cluster lifecycle causes Amazon EMR to terminate the cluster and all of its instances unless you enable termination protection. If a cluster terminates because of a failure, any data stored on the cluster is deleted, and the cluster state is set to `TERMINATED_WITH_ERRORS`. If you enabled termination protection, you can retrieve data from your cluster, and then remove termination protection and terminate the cluster. For more information, see [Using termination protection to protect your Amazon EMR clusters from accidental shut down](UsingEMR_TerminationProtection.md). 

The following diagram represents the lifecycle of a cluster, and how each stage of the lifecycle maps to a particular cluster state. 

![\[Diagram for Amazon EMR showing the cluster lifecycle, and how each stage of the lifecycle maps to a particular cluster state.\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/emr-cluster-lifecycle.png)


# Benefits of using Amazon EMR
<a name="emr-overview-benefits"></a>

There are many benefits to using Amazon EMR. These include the flexibility offered through AWS and the cost savings available versus building your own on-premises resources. This section provides an overview of these benefits and links to additional information to help you explore further.

**Topics**
+ [Cost savings](#emr-benefits-cost)
+ [AWS integration](#emr-benefits-integration)
+ [Deployment](#emr-benefits-deployment)
+ [Scalability and flexibility](#emr-benefits-scalability)
+ [Reliability](#emr-benefits-reliability)
+ [Security](#emr-benefits-security)
+ [Monitoring](#emr-benefits-monitoring)
+ [Management interfaces](#emr-what-tools)

## Cost savings
<a name="emr-benefits-cost"></a>

Amazon EMR pricing depends on the instance type and number of Amazon EC2 instances that you deploy and the Region in which you launch your cluster. On-demand pricing offers low rates, but you can reduce the cost even further by purchasing Reserved Instances or Spot Instances. Spot Instances can offer significant savings—as low as a tenth of on-demand pricing in some cases.

**Note**  
If you use Amazon S3, Amazon Kinesis, or DynamoDB with your EMR cluster, there are additional charges for those services that are billed separately from your Amazon EMR usage.

**Note**  
When you set up an Amazon EMR cluster in a private subnet, we recommend that you also set up [VPC endpoints for Amazon S3](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html). If your EMR cluster is in a private subnet without VPC endpoints for Amazon S3, you will incur additional NAT gateway charges that are associated with S3 traffic because the traffic between your EMR cluster and S3 will not stay within your VPC.

For more information about pricing options and details, see [Amazon EMR pricing](https://aws.amazon.com/elasticmapreduce/pricing/).

## AWS integration
<a name="emr-benefits-integration"></a>

Amazon EMR integrates with other AWS services to provide capabilities and functionality related to networking, storage, security, and so on, for your cluster. The following list provides several examples of this integration:
+ Amazon EC2 for the instances that comprise the nodes in the cluster
+ Amazon Virtual Private Cloud (Amazon VPC) to configure the virtual network in which you launch your instances
+ Amazon S3 to store input and output data
+ Amazon CloudWatch to monitor cluster performance and configure alarms
+ AWS Identity and Access Management (IAM) to configure permissions
+ AWS CloudTrail to audit requests made to the service
+ AWS Data Pipeline to schedule and start your clusters
+ AWS Lake Formation to discover, catalog, and secure data in an Amazon S3 data lake

## Deployment
<a name="emr-benefits-deployment"></a>

Your EMR cluster consists of EC2 instances, which perform the work that you submit to your cluster. When you launch your cluster, Amazon EMR configures the instances with the applications that you choose, such as Apache Hadoop or Spark. Choose the instance size and type that best suits the processing needs for your cluster: batch processing, low-latency queries, streaming data, or large data storage. For more information about the instance types available for Amazon EMR, see [Configure Amazon EMR cluster hardware and networking](emr-plan-instances.md).

Amazon EMR offers a variety of ways to configure software on your cluster. For example, you can install an Amazon EMR release with a chosen set of applications that can include versatile frameworks, such as Hadoop, and applications, such as Hive, Pig, or Spark. You can also install one of several MapR distributions. Amazon EMR uses Amazon Linux, so you can also install software on your cluster manually using the yum package manager or from the source. For more information, see [Configure applications when you launch your Amazon EMR cluster](emr-plan-software.md).

## Scalability and flexibility
<a name="emr-benefits-scalability"></a>

Amazon EMR provides flexibility to scale your cluster up or down as your computing needs change. You can resize your cluster to add instances for peak workloads and remove instances to control costs when peak workloads subside. For more information, see [Manually resize a running Amazon EMR cluster](emr-manage-resize.md).

 Amazon EMR also provides the option to run multiple instance groups so that you can use On-Demand Instances in one group for guaranteed processing power together with Spot Instances in another group to have your jobs completed faster and at lower costs. You can also mix different instance types to take advantage of better pricing for one Spot Instance type over another. For more information, see [When should you use Spot Instances?](emr-plan-instances-guidelines.md#emr-plan-spot-instances). 

Additionally, Amazon EMR provides the flexibility to use several file systems for your input, output, and intermediate data. For example, you might choose the Hadoop Distributed File System (HDFS) which runs on the primary and core nodes of your cluster for processing data that you do not need to store beyond your cluster's lifecycle. You might choose the EMR File System (EMRFS) to use Amazon S3 as a data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster. EMRFS provides the added benefit of allowing you to scale up or down for your compute and storage needs independently. You can scale your compute needs by resizing your cluster and you can scale your storage needs by using Amazon S3. For more information, see [Working with storage and file systems with Amazon EMR](emr-plan-file-systems.md).

## Reliability
<a name="emr-benefits-reliability"></a>

Amazon EMR monitors nodes in your cluster and automatically terminates and replaces an instance in case of failure.

Amazon EMR provides configuration options that control if your cluster is terminated automatically or manually. If you configure your cluster to be automatically terminated, it is terminated after all the steps complete. This is referred to as a transient cluster. However, you can configure the cluster to continue running after processing completes so that you can choose to terminate it manually when you no longer need it. Or, you can create a cluster, interact with the installed applications directly, and then manually terminate the cluster when you no longer need it. The clusters in these examples are referred to as *long-running clusters*. 

Additionally, you can configure termination protection to prevent instances in your cluster from being terminated due to errors or issues during processing. When termination protection is enabled, you can recover data from instances before termination. The default settings for these options differ depending on whether you launch your cluster by using the console, CLI, or API. For more information, see [Using termination protection to protect your Amazon EMR clusters from accidental shut down](UsingEMR_TerminationProtection.md).

## Security
<a name="emr-benefits-security"></a>

Amazon EMR leverages other AWS services, such as IAM and Amazon VPC, and features such as Amazon EC2 key pairs, to help you secure your clusters and data.

### IAM
<a name="emr-benefits-iam"></a>

Amazon EMR integrates with IAM to manage permissions. You define permissions using IAM policies, which you attach to a users or IAM groups. The permissions that you define in the policy determine the actions that those users or members of the group can perform and the resources that they can access. For more information, see [How Amazon EMR works with IAM](security_iam_service-with-iam.md).

Additionally, Amazon EMR uses IAM roles for the Amazon EMR service itself and the EC2 instance profile for the instances. These roles grant permissions for the service and instances to access other AWS services on your behalf. There is a default role for the Amazon EMR service and a default role for the EC2 instance profile. The default roles use AWS managed policies, which are created for you automatically the first time you launch an EMR cluster from the console and choose default permissions. You can also create the default IAM roles from the AWS CLI. If you want to manage the permissions instead of AWS, you can choose custom roles for the service and instance profile. For more information, see [Configure IAM service roles for Amazon EMR permissions to AWS services and resources](emr-iam-roles.md).

### Security groups
<a name="emr-benefits-security-groups"></a>

Amazon EMR uses security groups to control inbound and outbound traffic to your EC2 instances. When you launch your cluster, Amazon EMR uses a security group for your primary instance and a security group to be shared by your core/task instances. Amazon EMR configures the security group rules to ensure communication among the instances in the cluster. Optionally, you can configure additional security groups and assign them to your primary and core/task instances for more advanced rules. For more information, see [Control network traffic with security groups for your Amazon EMR cluster](emr-security-groups.md).

### Encryption
<a name="emr-benefits-encryption"></a>

Amazon EMR supports optional Amazon S3 server-side and client-side encryption with EMRFS to help protect the data that you store in Amazon S3. With server-side encryption, Amazon S3 encrypts your data after you upload it.

With client-side encryption, the encryption and decryption process occurs in the EMRFS client on your EMR cluster. You manage the root key for client-side encryption using either the AWS Key Management Service (AWS KMS) or your own key management system.

For more information, see [Specifying Amazon S3 encryption using EMRFS properties](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-emrfs-encryption.html).

### Amazon VPC
<a name="emr-benefits-vpc"></a>

Amazon EMR supports launching clusters in a virtual private cloud (VPC) in Amazon VPC. A VPC is an isolated, virtual network in AWS that provides the ability to control advanced aspects of network configuration and access. For more information, see [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md).

### AWS CloudTrail
<a name="emr-benefits-cloudtrail"></a>

Amazon EMR integrates with CloudTrail to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request. For more information, see [Logging AWS EMR API calls using AWS CloudTrail](logging-using-cloudtrail.md).

### Amazon EC2 key pairs
<a name="emr-benefits-key-pairs"></a>

You can monitor and interact with your cluster by forming a secure connection between your remote computer and the primary node. You use the Secure Shell (SSH) network protocol for this connection or use Kerberos for authentication. If you use SSH, an Amazon EC2 key pair is required. For more information, see [Use an EC2 key pair for SSH credentials for Amazon EMR](emr-plan-access-ssh.md).

## Monitoring
<a name="emr-benefits-monitoring"></a>

You can use the Amazon EMR management interfaces and log files to troubleshoot cluster issues, such as failures or errors. Amazon EMR provides the ability to archive log files in Amazon S3 so you can store logs and troubleshoot issues even after your cluster terminates. Amazon EMR also provides an optional debugging tool in the Amazon EMR console to browse the log files based on steps, jobs, and tasks. For more information, see [Configure Amazon EMR cluster logging and debugging](emr-plan-debugging.md).

Amazon EMR integrates with CloudWatch to track performance metrics for the cluster and jobs within the cluster. You can configure alarms based on a variety of metrics such as whether the cluster is idle or the percentage of storage used. For more information, see [Monitoring Amazon EMR metrics with CloudWatch](UsingEMR_ViewingMetrics.md).

## Management interfaces
<a name="emr-what-tools"></a>

 There are several ways you can interact with Amazon EMR: 
+ **Console** — A graphical user interface that you can use to launch and manage clusters. With it, you fill out web forms to specify the details of clusters to launch, view the details of existing clusters, debug, and terminate clusters. Using the console is the easiest way to get started with Amazon EMR; no programming knowledge is required. The console is available online at [https://console.aws.amazon.com/elasticmapreduce/home](https://console.aws.amazon.com/elasticmapreduce/home). 
+ **AWS Command Line Interface (AWS CLI)** — A client application you run on your local machine to connect to Amazon EMR and create and manage clusters. The AWS CLI contains a feature-rich set of commands specific to Amazon EMR. With it, you can write scripts that automate the process of launching and managing clusters. If you prefer working from a command line, using the AWS CLI is the best option. For more information, see [Amazon EMR](https://docs.aws.amazon.com/cli/latest/reference/emr/index.html) in the *AWS CLI Command Reference*.
+ **Software Development Kit (SDK)** — SDKs provide functions that call Amazon EMR to create and manage clusters. With them, you can write applications that automate the process of creating and managing clusters. Using the SDK is the best option to extend or customize the functionality of Amazon EMR. Amazon EMR is currently available in the following SDKs: Go, Java, .NET (C\$1 and VB.NET), Node.js, PHP, Python, and Ruby. For more information about these SDKs, see [Tools for AWS](https://aws.amazon.com/tools/) and [Amazon EMR sample code & libraries](https://docs.aws.amazon.com/code-library/latest/ug/emr_code_examples.html). 
+ **Web Service API** — A low-level interface that you can use to call the web service directly, using JSON. Using the API is the best option to create a custom SDK that calls Amazon EMR. For more information, see the [Amazon EMR API Reference](https://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html). 

# Amazon EMR architecture and service layers
<a name="emr-overview-arch"></a>

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster. This section provides an overview of the layers and the components of each.

**Topics**
+ [Storage](#emr-arch-storage)
+ [Cluster resource management](#emr-arch-resource-management)
+ [Data processing frameworks](#emr-arch-processing-frameworks)
+ [Applications and programs](#emr-arch-applications)

## Storage
<a name="emr-arch-storage"></a>

The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options as follows.

### Hadoop Distributed File System (HDFS)
<a name="emr-storage-hdfs"></a>

Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O. 

For more information, see [Instance storage options and behavior in Amazon EMR](emr-plan-storage.md) in this guide or go to [HDFS User Guide](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html) on the Apache Hadoop website.

### EMR File System (EMRFS)
<a name="emr-storage-emrfs"></a>

Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

### Local file system
<a name="emr-storage-lfs"></a>

The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

## Cluster resource management
<a name="emr-arch-resource-management"></a>

The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data.

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks. However, there are other frameworks and applications that are offered in Amazon EMR that do not use YARN as a resource manager. Amazon EMR also has an agent on each node that administers YARN components, keeps the cluster healthy, and communicates with Amazon EMR.

Because Spot Instances are often used to run task nodes, Amazon EMR has default functionality for scheduling YARN jobs so that running jobs do not fail when task nodes running on Spot Instances are terminated. Amazon EMR does this by allowing application master processes to run only on core nodes. The application master process controls running jobs and needs to stay alive for the life of the job.

Amazon EMR release 5.19.0 and later uses the built-in [YARN node labels](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) feature to achieve this. (Earlier versions used a code patch). Properties in the `yarn-site` and `capacity-scheduler` configuration classifications are configured by default so that the YARN capacity-scheduler and fair-scheduler take advantage of node labels. Amazon EMR automatically labels core nodes with the `CORE` label, and sets properties so that application masters are scheduled only on nodes with the CORE label. Manually modifying related properties in the yarn-site and capacity-scheduler configuration classifications, or directly in associated XML files, could break this feature or modify this functionality.

## Data processing frameworks
<a name="emr-arch-processing-frameworks"></a>

The data processing framework layer is the engine used to process and analyze data. There are many frameworks available that run on YARN or have their own resource management. Different frameworks are available for different kinds of processing needs, such as batch, interactive, in-memory, streaming, and so on. The framework that you choose depends on your use case. This impacts the languages and interfaces available from the application layer, which is the layer used to interact with the data you want to process. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark. 

### Hadoop MapReduce
<a name="emr-processing-framework-mapreduce"></a>

Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results. The Reduce function combines the intermediate results, applies additional algorithms, and produces the final output. There are multiple frameworks available for MapReduce, such as Hive, which automatically generates Map and Reduce programs.

For more information, go to [How map and reduce operations are actually carried out](http://wiki.apache.org/hadoop2/HadoopMapReduce) on the Apache Hadoop Wiki website.

### Apache Spark
<a name="emr-processing-framework-spark"></a>

Spark is a cluster framework and programming model for processing big data workloads. Like Hadoop MapReduce, Spark is an open-source, distributed processing system but uses directed acyclic graphs for execution plans and in-memory caching for datasets. When you run Spark on Amazon EMR, you can use EMRFS to directly access your data in Amazon S3. Spark supports multiple interactive query modules such as SparkSQL.

For more information, see [Apache Spark on Amazon EMR clusters](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) in the *Amazon EMR Release Guide*.

## Applications and programs
<a name="emr-arch-applications"></a>

Amazon EMR supports many applications such as Hive, Pig, and the Spark Streaming library to provide capabilities such as using higher-level languages to create processing workloads, leveraging machine learning algorithms, making stream processing applications, and building data warehouses. In addition, Amazon EMR also supports open-source projects that have their own cluster management functionality instead of using YARN.

You use various libraries and languages to interact with the applications that you run in Amazon EMR. For example, you can use Java, Hive, or Pig with MapReduce or Spark Streaming, Spark SQL, MLlib, and GraphX with Spark.

For more information, see the [Amazon EMR Release Guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/).