

# Submit work to an Amazon EMR cluster


This section describes the methods that you can use to submit work to an Amazon EMR cluster. To submit work, you can add steps, or you can interactively submit Hadoop jobs to the primary node.

Consider the following rules of step behavior when you submit steps to a cluster:
+ A step ID can contain up to 256 characters.
+ You can have up to 256 PENDING and RUNNING steps in a cluster.
+ Even if you have 256 active steps running on a cluster, you can interactively submit jobs to the primary node. You can submit an unlimited number of steps over the lifetime of a long-running cluster, but only 256 steps can be RUNNING or PENDING at any given time.
+ With Amazon EMR versions 4.8.0 and later, except version 5.0.0, you can cancel pending steps. For more information, see [Cancel steps when you submit work to an Amazon EMR cluster](emr-cancel-steps.md).
+ With Amazon EMR versions 5.28.0 and later, you can cancel both pending and running steps. You can also choose to run multiple steps in parallel to improve cluster utilization and save cost. For more information, see [Considerations for running multiple steps in parallel when you submit work to Amazon EMR](emr-concurrent-steps.md).

**Note**  
For the best performance, we recommend that you store custom bootstrap actions, scripts, and other files that you want to use with Amazon EMR in an Amazon S3 bucket that is in the same AWS Region as your cluster.

**Topics**
+ [

# Adding steps to a cluster with the Amazon EMR Management Console
](emr-add-steps-console.md)
+ [

# Adding steps to an Amazon EMR cluster with the AWS CLI
](add-step-cli.md)
+ [

# Considerations for running multiple steps in parallel when you submit work to Amazon EMR
](emr-concurrent-steps.md)
+ [

# Viewing steps after submitting work to an Amazon EMR cluster
](emr-view-steps.md)
+ [

# Cancel steps when you submit work to an Amazon EMR cluster
](emr-cancel-steps.md)

# Adding steps to a cluster with the Amazon EMR Management Console
Add steps with the console

Use the following procedures to add steps to a cluster with the AWS Management Console. For detailed information about how to submit steps for specific big data applications, see the following sections of the *[Amazon EMR Release Guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html)*:
+ [Submit a custom JAR step ](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-launch-custom-jar-cli.html) 
+ [Submit a Hadoop streaming step ](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/CLI_CreateStreaming.html) 
+ [Submit a Spark step ](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html) 
+ [Submit a Pig step](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-pig-launch.html#ConsoleCreatingaPigJob) 
+ [Run a command or script as a step ](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-launch-custom-jar-cli.html) 
+ [Pass values into steps to run Hive scripts](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html#emr-hive-additional-features) 

## Add steps during cluster creation
During cluster creation

From the AWS Management Console, you can add steps when you create a cluster.

------
#### [ Console ]

**To add steps when you create a cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Steps**, choose **Add step**. Enter appropriate values in the fields in the **Add step** dialog. For information on formatting your step arguments, see [Add step arguments](#emr-add-steps-console-arguments). Options differ depending on the step type. To add your step and exit the dialog, select **Add step**.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------

## Add steps to a running cluster
Running cluster

With the AWS Management Console, you can add steps to a cluster with the auto-terminate option disabled. 

------
#### [ Console ]

**To add steps to a running cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update.

1. On the **Steps** tab on the cluster details page, select **Add step**. To clone an existing step, choose the **Actions** dropdown menu and select **Clone step**.

1. Enter appropriate values in the fields in the **Add step** dialog. Options differ depending on the step type. To add your step and exit the dialog, choose **Add step**.

------

## Modify the step concurrency level in a running cluster
Modify step concurrency

With the AWS Management Console, you can modify the step concurrency level in a running cluster. 

**Note**  
You can only run multiple steps in parallel with Amazon EMR version 5.28.0 and later. 

------
#### [ Console ]

**To modify step concurrency in a running cluster with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and select the cluster that you want to update. The cluster must be running to change its concurrency attribute.

1. On the **Steps** tab on the cluster details page, find the **Attributes** section. Select **Edit** to change the concurrency. Enter a value between 1 and 256.

------

## Add step arguments


When you use the AWS Management Console to add a step to your cluster, you can specify arguments for that step in the **Arguments** field. You must separate arguments with whitespace and surround string arguments that consist of characters *and* whitespace with quotation marks.

**Example : Correct arguments**  
The following example arguments are formatted correctly for the AWS Management Console, with quotation marks around the final string argument.  

```
bash -c "aws s3 cp s3://amzn-s3-demo-bucket/my-script.sh ."
```
You can also put each argument on a separate line for readability as shown in the following example.  

```
bash 
-c 
"aws s3 cp s3://amzn-s3-demo-bucket/my-script.sh ."
```

**Example : Incorrect arguments**  
The following example arguments are improperly formatted for the AWS Management Console. Notice that the final string argument, `aws s3 cp s3://amzn-s3-demo-bucket/my-script.sh .`, contains whitespace and is not surrounded by quotation marks.  

```
bash -c aws s3 cp s3://amzn-s3-demo-bucket/my-script.sh .
```

# Adding steps to an Amazon EMR cluster with the AWS CLI
Add steps with the CLI

The following procedures demonstrate how to add steps to a newly created cluster and to a running cluster with the AWS CLI. Both examples use the `--steps` subcommand to add steps to the cluster. 

**To add steps during cluster creation**
+ Type the following command to create a cluster and add an Apache Pig step. Make sure to replace *`myKey`* with the name of your Amazon EC2 key pair.

  ```
  1. aws emr create-cluster --name "Test cluster" \
  2. --applications Name=Spark \
  3. --use-default-roles \
  4. --ec2-attributes KeyName=myKey \
  5. --instance-groups InstanceGroupType=PRIMARY,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge \
  6. --steps '[{"Args":["spark-submit","--deploy-mode","cluster","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","5"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]'
  ```
**Note**  
The list of arguments changes depending on the type of step.

  By default, the step concurrency level is `1`. You can set the step concurrency level with the `StepConcurrencyLevel` parameter when you create a cluster. 

  The output is a cluster identifier similar to the following. 

  ```
  1. {
  2.     "ClusterId": "j-2AXXXXXXGAPLF"
  3. }
  ```

**To add a step to a running cluster**
+ Type the following command to add a step to a running cluster. Replace `j-2AXXXXXXGAPLF` with your own cluster ID.

  ```
  aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \
  --steps '[{"Args":["spark-submit","--deploy-mode","cluster","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","5"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]'
  ```

   The output is a step identifier similar to the following.

  ```
  1. {
  2.     "StepIds": [
  3. 	"s-Y9XXXXXXAPMD"
  4.     ]
  5. }
  ```

**To modify the StepConcurrencyLevel in a running cluster**

1. In a running cluster, you can modify the `StepConcurrencyLevel` with the `ModifyCluster` API. For example, type the following command to increase the `StepConcurrencyLevel` to `10`. Replace `j-2AXXXXXXGAPLF` with your cluster ID.

   ```
   aws emr modify-cluster --cluster-id j-2AXXXXXXGAPLF --step-concurrency-level 10
   ```

1. The output is similar to the following.

   ```
   {
   "StepConcurrencyLevel": 10
   }
   ```

For more information on using Amazon EMR commands in the AWS CLI, see the [AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/emr).

# Considerations for running multiple steps in parallel when you submit work to Amazon EMR
Running multiple steps

Running multiple steps in parallel when you submit work to Amazon EMR requires preliminary decisions about resource planning and expectations regarding cluster behavior. These are covered in detail here.
+ Steps running in parallel may complete in any order, but pending steps in queue transition to running state in the order they were submitted.
+ When you select a step concurrency level for your cluster, you must consider whether or not the primary node instance type meets the memory requirements of user workloads. The main step executer process runs on the primary node for each step. Running multiple steps in parallel requires more memory and CPU utilization from the primary node than running one step at a time. 
+ To achieve complex scheduling and resource management of concurrent steps, you can use YARN scheduling features such as `FairScheduler` or `CapacityScheduler`. For example, you can use `FairScheduler` with a `queueMaxAppsDefault` set to prevent more than a certain number of jobs from running at a time. 
+ The step concurrency level is subject to the configurations of resource managers. For example, if YARN is configured with only a parallelism of `5`, then you can only have five YARN applications running in parallel even if the `StepConcurrencyLevel` is set to `10`. For more information about configuring resource managers, see [Configure applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html) in the *Amazon EMR Release Guide*.
+ You cannot add a step with an `ActionOnFailure` other than CONTINUE while the step concurrency level of the cluster is greater than 1.
+ If the step concurrency level of a cluster is greater than one, step `ActionOnFailure` feature will not activate.
+ If a cluster has step concurrency level `1` but has multiple running steps, `TERMINATE_CLUSTER ActionOnFailure` may activate, but `CANCEL_AND_WAIT ActionOnFailure` will not. This edge case arises when the cluster step concurrency level was greater than one, but lowered while multiple steps were running.
+ You can use EMR automatic scaling to scale up and down based on the YARN resources to prevent resource contention. For more information, see [Using automatic scaling with a custom policy for instance groups](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html) in the *Amazon EMR Management Guide*.
+ When you decrease the step concurrent level, EMR allows any running steps to complete before reducing the number of steps. If the resources are exhausted because the cluster is running too many concurrent steps, we recommend manually canceling any running steps to free up resources.

# Viewing steps after submitting work to an Amazon EMR cluster


You can see up to 10,000 steps that Amazon EMR completed within the last seven days. You can also view 1,000 steps that Amazon EMR completed any time. This total includes both user-submitted and system steps.

If you submit new steps once the cluster reaches the 1,000 step record limit, Amazon EMR deletes the inactive user-submitted steps whose statuses have been COMPLETED, CANCELLED, or FAILED for more than seven days. If you submit steps beyond the 10,000 step record limit, Amazon EMR deletes the inactive user-submitted step records regardless of their inactive duration. Amazon EMR doesn't remove these records from the log files. Amazon EMR removes them from the AWS console, and they aren't returned when you use the AWS CLI or API to retrieve cluster information. System step records are never removed.

The step information you can view depends on the mechanism used to retrieve cluster information. The following table indicates the step information returned by each of the available options. 

 


| Option | DescribeJobFlow or --describe --jobflow | ListSteps or list-steps | 
| --- | --- | --- | 
| SDK | 256 steps | Up to 10,000 steps | 
| Amazon EMR CLI | 256 steps | NA | 
| AWS CLI | NA | Up to 10,000 steps | 
| API | 256 steps | Up to 10,000 steps | 

# Cancel steps when you submit work to an Amazon EMR cluster


You can cancel pending and running steps from the AWS Management Console, the AWS CLI, or the Amazon EMR, when you submit work to your cluster. API.

------
#### [ Console ]

**To cancel steps with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then select the cluster that you want to update.

1. On the **Steps** tab on the cluster details page, select the check box next to the step you wants to cancel. Choose the **Actions** dropdown menu and then select **Cancel steps**.

1. In the **Cancel the step** dialog, choose to either cancel the step and wait for it to exit, or cancel the step and force it to exit. Then choose **Confirm**.

1. The status of the steps in the **Steps** table changes to `CANCELLED`. 

------
#### [ CLI ]

**To cancel with using the AWS CLI**
+ Use the `aws emr cancel-steps` command, specifying the cluster and steps to cancel. The following example demonstrates an AWS CLI command to cancel two steps.

  ```
  aws emr cancel-steps --cluster-id j-2QUAXXXXXXXXX \
  --step-ids s-3M8DXXXXXXXXX s-3M8DXXXXXXXXX \
  --step-cancellation-option SEND_INTERRUPT
  ```

With Amazon EMR version 5.28.0, you can choose one of the two following cancellation options for `StepCancellationOption` parameter when canceling steps. 
+ `SEND_INTERRUPT`– This is the default option. When a step cancellation request is received, EMR sends a `SIGTERM` signal to the step. add a `SIGTERM` signal handler to your step logic to catch this signal and terminate descendant step processes or wait for them to complete.
+ `TERMINATE_PROCESS` – When this option is selected, EMR sends a `SIGKILL` signal to the step and all its descendant processes which terminates them immediately.

------

**Considerations for canceling steps**
+ Canceling a running or pending step removes that step from the active step count.
+ Canceling a running step does not allow a pending step to start running, assuming no change to `stepConcurrencyLevel`.
+ Canceling a running step does not trigger the step `ActionOnFailure`.
+ For EMR 5.32.0 and later, `SEND_INTERRUPT StepCancellationOption` sends a `SIGTERM` signal to the step child process. You should watch for this signal and do a cleanup and shutdown gracefully. The `TERMINATE_PROCESS StepCancellationOption` sends a `SIGKILL` signal to the step child process and all of its descendant processes; however, asynchronous processes are not affected.