# Configure applications


To override the default configurations for an application, you can supply a configuration object. You can either use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties correspond to the application settings you want to change. You can specify multiple classifications for multiple applications in a single JSON object.

**Warning**  
Amazon EMR Describe and List API operations emit custom and configurable settings, which are used as a part of Amazon EMR job flows, in plaintext. To provide sensitive information, such as passwords, in these settings, see [Store sensitive configuration data in AWS Secrets Manager](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/storing-sensitive-data.html).

The configuration classifications that are available vary by Amazon EMR release version. For a list of configuration classifications that are supported in a particular release version, refer to the page for that release version under [About Amazon EMR Releases](emr-release-components.md).

The following is example JSON file for a list of configurations.

```
[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "0.90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]
```

A configuration classification often maps to an application-specific configuration file. For example, the `hive-site` classification maps to settings in the `hive-site.xml` configuration file for Hive. An exception to this is the no longer supported bootstrap action `configure-daemons`, which is used to set environment parameters such as `--namenode-heap-size`. Options like this are subsumed into the `hadoop-env` and `yarn-env` classifications with their own nested export classifications. If any classification ends in `env`, use the export sub-classification. 

Another exception is `s3get`, which is used to place a customer `EncryptionMaterialsProvider` object on each node in a cluster for use in client-side encryption. An option was added to the `emrfs-site` classification for this purpose.

The following is an example of the `hadoop-env` classification.

```
[
  {
    "Classification": "hadoop-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_DATANODE_HEAPSIZE": "2048",
          "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]
```

The following is an example of the yarn-env classification.

```
[
  {
    "Classification": "yarn-env",
    "Properties": {
      
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "YARN_RESOURCEMANAGER_OPTS": "-Xdebug -Xrunjdwp:transport=dt_socket"
        },
        "Configurations": [
          
        ]
      }
    ]
  }
]
```

The following settings do not belong to a configuration file but are used by Amazon EMR to potentially configure multiple settings on your behalf.


**Settings curated by Amazon EMR**  

| Application | Release label classification | Valid properties | When to use | 
| --- | --- | --- | --- | 
| Spark | spark | maximizeResourceAllocation | Configure executors to utilize the maximum resources of each node. | 

**Topics**
+ [

# Configure applications when you create a cluster
](emr-configure-apps-create-cluster.md)
+ [

# Reconfigure an instance group in a running cluster
](emr-configure-apps-running-cluster.md)
+ [

# Store sensitive configuration data in AWS Secrets Manager
](storing-sensitive-data.md)
+ [

# Configure applications to use a specific Java Virtual Machine
](configuring-java8.md)

# Configure applications when you create a cluster


When you create a cluster, you can override the default configurations for applications using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. 

To override the default configuration for an application, you specify custom values in a configuration classification. A configuration classification corresponds to a configuration XML file for an application, such as `hive-site.xml`. 

Configuration classifications vary by Amazon EMR release version. For a list of configuration classifications that are available in a specific release version, see the release detail page. For example, [Amazon EMR release 6.4.0.](emr-640-release.md#emr-640-class)

## Supply a configuration in the console when you create a cluster


To supply a configuration, navigate to the **Create cluster** page and expand **Software settings**. You can then enter the configuration directly by using either JSON or a shorthand syntax demonstrated in shadow text in the console. Otherwise, you can provide an Amazon S3 URI for a file with a JSON `Configurations` object.

To supply a configuration for an instance group, choose a cluster in your list of clusters, then choose the **Configurations** tab. In the **Instance group configurations** table, choose the instance group to edit, then choose **Reconfigure**.

## Supply a configuration using the AWS CLI when you create a cluster


You can provide a configuration to **create-cluster** by supplying a path to a JSON file stored locally or in Amazon S3. The following example assumes that you are using default roles for Amazon EMR and that the roles have been created. If you need to create the roles, run `aws emr create-default-roles` first.

If your configuration is in your local directory, you can use the following example command.

```
aws emr create-cluster --use-default-roles --release-label emr-7.12.0 --applications Name=Hive \
--instance-type m5.xlarge --instance-count 3 --configurations file://./configurations.json
```

If your configuration is in an Amazon S3 path, you'll need to set up the following workaround before passing the Amazon S3 path to the `create-cluster` command.

```
#!/bin/sh
# Assume the ConfigurationS3Path is not public, and its present in the same AWS account as the EMR cluster
ConfigurationS3Path="s3://amzn-s3-demo-bucket/config.json"
# Get a presigned HTTP URL for the s3Path
ConfigurationURL=`aws s3 presign $ConfigurationS3Path --expires-in 300`
# Fetch the presigned URL, and minify the JSON so that it spans only a single line
Configurations=`curl $ConfigurationURL | jq -c .`
aws emr create-cluster --use-default-roles --release-label emr-5.34.0 --instance-type m5.xlarge --instance-count 2 --applications Name=Hadoop Name=Spark --configurations $Configurations
```

## Supply a configuration using the Java SDK when you create a cluster


The following program excerpt shows how to supply a configuration using the AWS SDK for Java.

```
Application hive = new Application().withName("Hive");

Map<String,String> hiveProperties = new HashMap<String,String>();
	hiveProperties.put("hive.join.emit.interval","1000");
	hiveProperties.put("hive.merge.mapfiles","true");
	    
Configuration myHiveConfig = new Configuration()
	.withClassification("hive-site")
	.withProperties(hiveProperties);

RunJobFlowRequest request = new RunJobFlowRequest()
	.withName("Create cluster with ReleaseLabel")
	.withReleaseLabel("emr-5.20.0")
	.withApplications(hive)
	.withConfigurations(myHiveConfig)
	.withServiceRole("EMR_DefaultRole")
	.withJobFlowRole("EMR_EC2_DefaultRole")
	.withInstances(new JobFlowInstancesConfig()
		.withEc2KeyName("myEc2Key")
		.withInstanceCount(3)
		.withKeepJobFlowAliveWhenNoSteps(true)
		.withMasterInstanceType("m4.large")
		.withSlaveInstanceType("m4.large")
	);
```

# Reconfigure an instance group in a running cluster


With Amazon EMR version 5.21.0 and later, you can reconfigure cluster applications and specify additional configuration classifications for each instance group in a running cluster. To do so, you can use the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.

When you update an application configuration for an instance group in the new Amazon EMR console, the console attempts to merge the new configuration with the existing configuration to create a new, active configuration. In the unusual case where Amazon EMR can't merge the configuration, the console alerts you. 

After you submit a reconfiguration request for an instance group, Amazon EMR assigns a version number to the new configuration specification. You can track the version number of a configuration, or the state of an instance group, by viewing the CloudWatch events. For more information, see [Monitor CloudWatch Events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html).

**Note**  
You can only override, and not delete, cluster configurations that were specified during cluster creation. If there are differences between the existing configuration and the file that you supply, Amazon EMR resets manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance group. 

## Considerations when you reconfigure an instance group


**Reconfiguration actions**  
When you submit a reconfiguration request using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK, Amazon EMR checks the existing on-cluster configuration file. If there are differences between the existing configuration and the file that you supply, Amazon EMR initiates reconfiguration actions, restarts some applications, and resets any manually modified configurations, such as configurations that you have modified while connected to your cluster using SSH, to the cluster defaults for the specified instance group.   
Amazon EMR performs some default actions during every instance group reconfiguration. These default actions might conflict with cluster customizations that you have made, and result in reconfiguration failures. For information about how to troubleshoot reconfiguration failures, see [Troubleshoot instance group reconfiguration](#emr-configure-apps-running-cluster-troubleshoot).
Amazon EMR also initiates reconfiguration actions for the configuration classifications that you specify in your request. For a complete list of these actions, see the Configuration Classifications section for the version of Amazon EMR that you use. For example, [6.2.0 Configuration Classifications](emr-620-release.md#emr-620-class).  
The Amazon EMR Release Guide only lists reconfiguration actions starting with Amazon EMR versions 5.32.0 and 6.2.0.

**Service disruption**  
Amazon EMR follows a rolling process to reconfigure instances in the Task and Core instance groups. Only 10 percent of the instances in an instance group are modified and restarted at a time. This process takes longer to finish but reduces the chance of potential application failure in a running cluster.   
To run YARN jobs during a YARN restart, you can either create an Amazon EMR cluster with multiple master nodes or set `yarn.resourcemanager.recovery.enabled` to `true` in your `yarn-site` configuration classification. For more information about using multiple master nodes, see [High availability YARN ResourceManager](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-YARN).

**Application validation**  
Amazon EMR checks that each application on the cluster is running after the reconfiguration restart process. If any application is unavailable, the overall reconfiguration operation fails. If a reconfiguration operation fails, Amazon EMR reverses the configuration parameters to the previous working version.  
To avoid reconfiguration failure, we recommend that you only install applications on your cluster that you plan to use. We also recommend that you make sure all cluster applications are healthy and running before you submit a reconfiguration request.

**Types of reconfiguration**  
You can reconfigure an instance group in one of two ways:  
+ **Overwrite**. Default reconfiguration method and the only one available in Amazon EMR releases earlier than 5.35.0 and 6.6.0. This reconfiguration method indiscriminately overwrites any on-cluster files with the newly submitted configuration set. The method erases any changes to configuration files made outside the reconfiguration API.
+ **Merge**. Reconfiguration method supported for Amazon EMR releases 5.35.0 and 6.6.0 and later, except from the Amazon EMR console, where no version supports it. This reconfiguration method merges the newly submitted configurations with configurations that already exist on the cluster. This option only adds or modifies the new configurations that you submit. It preserves existing configurations.
Amazon EMR continues to overwrite some essential Hadoop configurations that it needs to ensure that the service is running correctly.

**Limitations**

When you reconfigure an instance group in a running cluster, consider the following limitations:
+ Non-YARN applications can fail during restart or cause cluster issues, especially if the applications aren't configured properly. Clusters approaching maximum memory and CPU usage may run into issues after the restart process. This is especially true for the master instance group.
+ You can't submit a reconfiguration request when an instance group is being resized. If a reconfiguration is initiated while an instance group is resizing, reconfiguration cannot start until the instance group has completed resizing, and vice versa. 
+ After reconfiguring an instance group, Amazon EMR restarts the applications to allow the new configurations to take effect. Job failure or other unexpected application behavior might occur if the applications are in use during reconfiguration. 
+ If a reconfiguration for an instance group fails, Amazon EMR reverses the configuration parameters to the previous working version. If the reversion process fails too, you must submit a new `ModifyInstanceGroup` request to recover the instance group from the `SUSPENDED` state.
+ Reconfiguration requests for Phoenix configuration classifications are only supported in Amazon EMR version 5.23.0 and later, and are not supported in Amazon EMR version 5.21.0 or 5.22.0. 
+ Reconfiguration requests for HBase configuration classifications are only supported in Amazon EMR version 5.30.0 and later, and are not supported in Amazon EMR versions 5.23.0 through 5.29.0. 
+ Amazon EMR supports application reconfiguration requests on an Amazon EMR cluster with multiple primary nodes only in Amazon EMR versions 5.27.0 and later.
+ Reconfiguring `hdfs-encryption-zones` classification or any of the Hadoop KMS configuration classifications is not supported on an Amazon EMR cluster with multiple primary nodes.
+ Amazon EMR currently doesn't support certain reconfiguration requests for the capacity scheduler that require restarting the YARN ResourceManager. For example, you cannot completely remove a queue.

## Reconfigure an instance group in the console


**Note**  
The Amazon EMR console does not support **Merge** type reconfigurations.

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr)

1. In the cluster list under **Name**, choose the active cluster that you want to reconfigure.

1. Open the cluster details page for the cluster, and go to the **Configurations** tab. 

1. In the **Filter** drop-down list, select the instance group that you want to reconfigure. 

1. In the **Reconfigure** drop-down menu, choose either **Edit in table** or **Edit in JSON file**.
   + **Edit in table** - In the configuration classification table, edit the property and value for existing configurations, or choose **Add configuration** to supply additional configuration classifications. 
   + **Edit in JSON file** - Enter the configuration directly in JSON, or use shorthand syntax (demonstrated in shadow text). Otherwise, provide an Amazon S3 URI for a file with a JSON `Configurations` object.
**Note**  
The **Source** column in the configuration classification table indicates whether the configuration is supplied when you create a cluster, or when you specify additional configurations for this instance group. You can edit the configurations for an instance group from both sources. You cannot delete initial cluster configurations, but you can override them for an instance group.   
You can also add or edit nested configuration classifications directly in the table. For example, to supply an additional `export` sub-classification of `hadoop-env`, add a `hadoop.export` configuration classification in the table. Then, provide a specific property and value for this classification. 

1. (Optional) Select **Apply this configuration to all active instance groups**.

1. Save the changes.

## Reconfigure an instance group using the CLI


Use the **modify-instance-groups** command to specify a new configuration for an instance group in a running cluster.

**Note**  
In the following examples, replace *<j-2AL4XXXXXX5T9>* with your cluster ID, and replace *<ig-1xxxxxxx9>* with your instance group ID.

**Example – Replace a configuration for an instance group**  
The following example references a configuration JSON file called `instanceGroups.json` to edit the property of the YARN NodeManager disk health checker for an instance group.  

1. Prepare your configuration classification, and save it as `instanceGroups.json` in the same directory where you will run the command.

   ```
   [
      {
         "InstanceGroupId":"<ig-1xxxxxxx9>",
         "Configurations":[
            {
               "Classification":"yarn-site",
               "Properties":{
                  "yarn.nodemanager.disk-health-checker.enable":"true",
                  "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
               },
               "Configurations":[]
            }
         ]
      }
   ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-groups --cluster-id <j-2AL4XXXXXX5T9> \
   --instance-groups file://instanceGroups.json
   ```

**Example – Add a configuration to an instance group**  
If you want to add a configuration to an instance group, you must include all previously specified configurations for that instance group in your new `ModifyInstanceGroup` request. Otherwise, the previously specified configurations are removed.  
The following example adds a property for the YARN NodeManager virtual memory checker. The configuration also includes previously specified values for the YARN NodeManager disk health checker so that the values won't be overwritten.  

1. Prepare the following contents in `instanceGroups.json` and save it in the same directory where you will run the command.

   ```
   [
      {
         "InstanceGroupId":"<ig-1xxxxxxx9>",
         "Configurations":[
            {
               "Classification":"yarn-site",
               "Properties":{
                  "yarn.nodemanager.disk-health-checker.enable":"true",
                  "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0",
                  "yarn.nodemanager.vmem-check-enabled":"true",
                  "yarn.nodemanager.vmem-pmem-ratio":"3.0"
               },
               "Configurations":[]
            }
         ]
      }
   ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-groups --cluster-id <j-2AL4XXXXXX5T9> \
   --instance-groups file://instanceGroups.json
   ```

**Example – Add a configuration to an instance group with **Merge** type reconfiguration**  
When you want to use the default **Overwrite** reconfiguration method to add a configuration, you must include all previously specified configurations for that instance group in your new `ModifyInstanceGroup` request. Otherwise, the **Overwrite** removes the configurations that you previously specified. You don't need to do this with **Merge** reconfiguration. Instead, you must ensure that your request only includes the new configurations are included.  
The following example adds a property for the YARN NodeManager virtual memory checker. Because this is a **Merge** type reconfiguration, it does not overwrite previously specified values for the YARN NodeManager disk health checker.  

1. Prepare the following contents in `instanceGroups.json` and save it in the same directory where you will run the command.

   ```
   [
      {"InstanceGroupId":"<ig-1xxxxxxx9>",
       "ReconfigurationType" :"MERGE",
         "Configurations":[
            {"Classification":"yarn-site",
               "Properties":{
                  "yarn.nodemanager.vmem-check-enabled":"true",
                  "yarn.nodemanager.vmem-pmem-ratio":"3.0"
               },
               "Configurations":[]
            }
         ]
      }
   ]
   ```

1. Run the following command.

   ```
   aws emr modify-instance-groups --cluster-id <j-2AL4XXXXXX5T9> \
   --instance-groups file://instanceGroups.json
   ```

**Example – Delete a configuration for an instance group**  
To delete a configuration for an instance group, submit a new reconfiguration request that excludes the previous configuration.   
You can only override the initial *cluster* configuration. You cannot delete it.
For example, to delete the configuration for the YARN NodeManager disk health checker from the previous example, submit a new `instanceGroups.json` with the following contents.   

```
[
   {
      "InstanceGroupId":"<ig-1xxxxxxx9>",
      "Configurations":[
         {
            "Classification":"yarn-site",
            "Properties":{
               "yarn.nodemanager.vmem-check-enabled":"true",
               "yarn.nodemanager.vmem-pmem-ratio":"3.0"
            },
            "Configurations":[]
         }
      ]
   }
]
```
To delete all of the configurations in your last reconfiguration request, submit a reconfiguration request with an empty array of configurations. For example,  

```
[
   {
      "InstanceGroupId":"<ig-1xxxxxxx9>",
      "Configurations":[]
   }
]
```

**Example – Reconfigure and resize an instance group in one request**  
The following example JSON demonstrates how to reconfigure and resize an instance group in the same request.  

```
[
   {
      "InstanceGroupId":"<ig-1xxxxxxx9>",
      "InstanceCount":5,
      "EC2InstanceIdsToTerminate":["i-123"],
      "ForceShutdown":true,
      "ShrinkPolicy":{
         "DecommissionTimeout":10,
         "InstanceResizePolicy":{
            "InstancesToTerminate":["i-123"],
            "InstancesToProtect":["i-345"],
            "InstanceTerminationTimeout":20
         }
      },
      "Configurations":[
         {
            "Classification":"yarn-site",
            "Configurations":[],
            "Properties":{
               "yarn.nodemanager.disk-health-checker.enable":"true",
               "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
            }
         }
      ]
   }
]
```

## Reconfigure an instance group using the Java SDK


**Note**  
In the following examples, replace *<j-2AL4XXXXXX5T9>* with your cluster ID, and replace *<ig-1xxxxxxx9>* with your instance group ID.

The following code snippet provides a new configuration for an instance group using the AWS SDK for Java.

```
AWSCredentials credentials = new BasicAWSCredentials("access-key", "secret-key");
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.join.emit.interval","1000");
hiveProperties.put("hive.merge.mapfiles","true");
        
Configuration configuration = new Configuration()
    .withClassification("hive-site")
    .withProperties(hiveProperties);
    
InstanceGroupModifyConfig igConfig = new InstanceGroupModifyConfig()
    .withInstanceGroupId("<ig-1xxxxxxx9>")
    .withReconfigurationType("MERGE");
    .withConfigurations(configuration);

ModifyInstanceGroupsRequest migRequest = new ModifyInstanceGroupsRequest()
    .withClusterId("<j-2AL4XXXXXX5T9>")
    .withInstanceGroups(igConfig);

emr.modifyInstanceGroups(migRequest);
```

The following code snippet deletes a previously specified configuration for an instance group by supplying an empty array of configurations.

```
List<Configuration> configurations = new ArrayList<Configuration>();

InstanceGroupModifyConfig igConfig = new InstanceGroupModifyConfig()
    .withInstanceGroupId("<ig-1xxxxxxx9>")
    .withConfigurations(configurations);

ModifyInstanceGroupsRequest migRequest = new ModifyInstanceGroupsRequest()
    .withClusterId("<j-2AL4XXXXXX5T9>")
    .withInstanceGroups(igConfig);

emr.modifyInstanceGroups(migRequest);
```

## Troubleshoot instance group reconfiguration
Troubleshoot

If the reconfiguration process for an instance group fails, Amazon EMR reverts the reconfiguration and logs a failure message using an Amazon CloudWatch event. The event provides a brief summary of the reconfiguration failure. It lists the instances for which reconfiguration has failed and corresponding failure messages. The following is an example failure message.

```
The reconfiguration operation for instance group ig-1xxxxxxx9 in Amazon EMR cluster j-2AL4XXXXXX5T9 (ExampleClusterName) 
failed at 2021-01-01 00:00 UTC and took 2 minutes to fail. Failed configuration version is example12345. 
Failure message: Instance i-xxxxxxx1, i-xxxxxxx2, i-xxxxxxx3 failed with message "This is an example failure message".
```

To gather more data about a reconfiguration failure, you can check the node provisioning logs. Doing so is particularly useful when you receive a message like the following.

```
i-xxxxxxx1 failed with message “Unable to complete transaction and some changes were applied.”
```

------
#### [ On the node ]

**To access node provisioning logs by connecting to a node**

1. Use SSH to connect to the node on which reconfiguration has failed. For instructions, see [Connect to your Linux instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html) in the *Amazon EC2* *User Guide for Linux Instances*.

1. Navigate to the following directory, which contains the node provisioning log files.

   ```
   /mnt/var/log/provision-node/
   ```

1. Open the `reports` subdirectory and search for the node provisioning report for your reconfiguration. The `reports` directory organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process.

   The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

1. You can examine a report using a file viewer like `zless`, as in the following example.

   ```
   zless 202104061715.yaml.gz
   ```

------
#### [ Amazon S3 ]

**To access node provisioning logs using Amazon S3**

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Open the Amazon S3 bucket that you specified when you configured the cluster to archive log files.

1. Navigate to the following folder, which contains the node provisioning log files:

   ```
   amzn-s3-demo-bucket/elasticmapreduce/<cluster id>/node/<instance id>/provision-node/
   ```

1. Open the `reports` folder and search for the node provisioning report for your reconfiguration. The `reports` folder organizes logs by reconfiguration version number, universally unique identifier (UUID), Amazon EC2 instance IP address, and timestamp. Each report is a compressed YAML file that contains detailed information about the reconfiguration process.

   The following is an example report file name and path.

   ```
   /reports/2/ca598xxx-cxxx-4xxx-bxxx-6dbxxxxxxxxx/ip-10-73-xxx-xxx.ec2.internal/202104061715.yaml.gz
   ```

1. To view a log file, you can download it from Amazon S3 to your local machine as a text file. For instructions, see [Downloading an object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html).

------

Each log file contains a detailed provisioning report for the associated reconfiguration. To find error message information, you can search for the `err` log level of a report. Report format depends on the version of Amazon EMR on your cluster. 

The following example shows error information for Amazon EMR release versions earlier than 5.32.0 and 6.2.0.

```
- !ruby/object:Puppet::Util::Log
      level: !ruby/sym err
      tags: 
        - err
      message: "Example detailed error message."
      source: Puppet
      time: 2021-01-01 00:00:00.000000 +00:00
```

Amazon EMR release versions 5.32.0 and 6.2.0 and later use the following format instead.

```
- level: err
  message: 'Example detailed error message.'
  source: Puppet
  tags:
  - err
  time: '2021-01-01 00:00:00.000000 +00:00'
  file: 
  line:
```

# Store sensitive configuration data in AWS Secrets Manager


The Amazon EMR describe and list API operations that emit custom configuration data (such as `DescribeCluster` and `ListInstanceGroups`) do so in plaintext. Amazon EMR integrates with AWS Secrets Manager so that you can store your data in Secrets Manager and use the secret ARN in your configurations. This way, you don't pass sensitive configuration data to Amazon EMR in plaintext and expose it to external APIs. If you indicate that a key-value pair contains an ARN for a secret stored in Secrets Manager, Amazon EMR retrieves this secret when it sends configuration data to the cluster. Amazon EMR doesn't send the annotation when it uses external APIs to display the configuration.

## Create a secret


To create a secret, follow the steps in [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) in the *AWS Secrets Manager User Guide*. In **Step 3**, you must choose the **Plaintext** field to enter your sensitive value.

Note that while Secrets Manager allows a secret to contain up to 65536 bytes, Amazon EMR limits the combined length of the property key (excluding the annotation) and the retrieved secret value to 1024 characters.

## Grant Amazon EMR access to retrieve the secret


Amazon EMR uses an IAM service role to provision and manage clusters for you. The service role for Amazon EMR defines the allowable actions for Amazon EMR when it provisions resources and performs service-level tasks that aren’t performed in the context of an Amazon EC2 instance running within a cluster. For more information about service roles, see [Service role for Amazon EMR (EMR role)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html) and [Customize IAM roles](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles-custom.html).

To allow Amazon EMR to retrieve the secret value from Secrets Manager, add the following policy statement to your Amazon EMR role when you launch your cluster.

```
{
   "Sid":"AllowSecretsRetrieval",
   "Effect":"Allow",
   "Action":"secretsmanager:GetSecretValue",
   "Resource":[
      "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
   ]
}
```

If you create the secret with a customer-managed AWS KMS key, you must also add `kms:Decrypt` permission to the Amazon EMR role for the key that you use. For more information, see [Authentication and access control for AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access.html) in the *AWS Secrets Manager User Guide*.

## Use the secret in a configuration classification


You can add the `EMR.secret@` annotation to any configuration property to indicate that its key-value pair contains an ARN for a secret stored in Secrets Manager.

The following example shows how to provide a secret ARN in a configuration classification:

```
{
   "Classification":"core-site",
   "Properties":{
      "presto.s3.access-key":"<sensitive-access-key>",
      "EMR.secret@presto.s3.secret-key":"arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
   }
}
```

When you create your cluster and submit your annotated configuration, Amazon EMR validates the configuration properties. If your configuration is valid, Amazon EMR strips the annotation from the configuration and retrieves the secret from Secrets Manager to create the actual configuration before applying it to the cluster:

```
{
   "Classification":"core-site",
   "Properties":{
      "presto.s3.access-key":"<sensitive-access-key>",
      "presto.s3.secret-key":"<my-secret-key-retrieved-from-Secrets-Manager>"
   }
}
```

When you call an action like `DescribeCluster`, Amazon EMR returns the current application configuration on the cluster. If an application configuration property is marked as containing a secret ARN, then the application configuration returned by the `DescribeCluster` call contains the ARN and not the secret value. This ensures that the secret value is only visible on the cluster:

```
{
   "Classification":"core-site",
   "Properties":{
      "presto.s3.access-key":"<sensitive-access-key>",
      "presto.s3.secret-key":"arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
   }
}
```

## Update the secret value


Amazon EMR retrieves the secret value from an annotated configuration whenever the attached instance group is starting, reconfiguring, or resizing. You can use Secrets Manager to modify the value of a secret used in the configuration of a running cluster. When you do, you can submit a reconfiguration request to each instance group that you want to receive the updated value. For more information on how to reconfigure an instance group, and things to consider when you do it, see [Reconfigure an instance group in a running cluster](emr-configure-apps-running-cluster.md).

# Configure applications to use a specific Java Virtual Machine


Amazon EMR releases have different default Java Virtual Machine (JVM) versions. This page explains the JVM support for different releases and applications.

## Considerations


For information about the supported Java versions for applications, see the application pages in the [Amazon EMR Release Guide](emr-release-components.md).
+ Amazon EMR only supports running one runtime version in a cluster, and doesn't support running different nodes or applications on different runtime versions on the same cluster.
+ For Amazon EMR 7.x, the default Java Virtual Machine (JVM) is Java 17 for applications that support Java 17, with the exception of Apache Livy. For more information about the supported JDK versions for applications, see the corresponding release page in the Amazon EMR Release Guide.
+ Starting with Amazon EMR 7.1.0, Flink supports and is set to Java 17 by default. To use a different version Java runtime, override the settings in `flink-conf`. For more information about configuring Flink to use Java 8 or Java 11, see [Configure Flink to run with Java 11](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/flink-configure.html#flink-configure-java11).
+ For Amazon EMR 5.x and 6.x, the default Java Virtual Machine (JVM) is Java 8.
  + For Amazon EMR releases 6.12.0 and higher, some applications also support Java 11 and 17. 
  + For Amazon EMR releases 6.9.0 and higher, Trino supports Java 17 as default. For more information about Java 17 with Trino, see [Trino updates to Java 17](https://trino.io/blog/2022/07/14/trino-updates-to-java-17.html) on the Trino blog.

Keep in mind the following application-specific considerations when you choose your runtime version:


**Application-specific Java configuration notes**  

| Application | Java configuration notes | 
| --- | --- | 
| Spark | To run Spark with a non-default Java version, you must configure both Spark and Hadoop. For examples, see [Override the JVM](#configuring-java8-override). [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/configuring-java8.html) | 
| Spark RAPIDS | You can run RAPIDS with the configured Java version for Spark. | 
| Iceberg | You can run Iceberg with the configured Java version of the application that is using it. | 
| Delta | You can run Delta with the configured Java version of the application that is using it. | 
| Hudi | You can run Hudi with the configured Java version of the application that is using it. | 
| Hadoop | To update the JVM for Hadoop, modify `hadoop-env`. For examples, see [Override the JVM](#configuring-java8-override). | 
| Hive | To set the Java version to 11 or 17 for Hive, configure the Hadoop JVM setting to the Java version that you want to use.  | 
| HBase | To update the JVM for HBase, modify `hbase-env`. By default, Amazon EMR sets the HBase JVM based on the JVM configuration for Hadoop unless you override the settings in `hbase-env`. For examples, see [Override the JVM](#configuring-java8-override). | 
| Flink | To update the JVM for Flink, modify `flink-conf`. By default, Amazon EMR sets the Flink JVM based on the JVM configuration for Hadoop unless you override the settings in `flink-conf`. For more information, see [Configure Flink to run with Java 11](flink-configure.md#flink-configure-java11). | 
| Oozie | To configure Oozie to run on Java 11 or 17, configure Oozie Server, the Oozie LauncherAM Launcher AM, and change your client-side executable and job configurations. You can also configure `EmbeddedOozieServer` to run on Java 17. For more information, see [Configure Java version for Oozie](oozie-java.md). | 
| Pig | Pig only supports Java 8. You can't use Java 11 or 17 with Hadoop and run Pig on the same cluster. | 

## Override the JVM


To override the JVM setting for an Amazon EMR release - for example, to use Java 17 with a cluster that uses Amazon EMR release 6.12.0 - supply the `JAVA_HOME` setting to its environment classification, which is `application-env` for all applications except Flink. For Flink, the environment classification is `flink-conf`. For steps to configure the Java runtime with Flink, see [Configure Flink to run with Java 11](flink-configure.md#flink-configure-java11).

**Topics**
+ [

### Override the JVM setting with Apache Spark
](#configuring-java8-override-spark)
+ [

### Override the JVM setting with Apache HBase
](#configuring-java8-override-hbase)
+ [

### Override the JVM setting with Apache Hadoop and Hive
](#configuring-java8-override-hadoop)

### Override the JVM setting with Apache Spark
Apache Spark

When you use Spark with Amazon EMR releases 6.12 and higher, you can set the environment so that the executors use Java 11 or 17. And when you use Spark with Amazon EMR releases lower than 5.x, and you write a driver for submission in cluster mode, the driver uses Java 7. However, you can set the environment to ensure that the executors use Java 8.

To override the JVM for Spark, set the Spark classification setting. In this sample, the Java version for Hadoop is the same, but that isn't required.

```
[
{
"Classification": "hadoop-env", 
        "Configurations": [
            {
"Classification": "export", 
                "Configurations": [], 
                "Properties": {
"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }, 
    {
"Classification": "spark-env", 
        "Configurations": [
            {
"Classification": "export", 
                "Configurations": [], 
                "Properties": {
"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ], 
        "Properties": {}
    }
]
```

Note that it's a recommended best practice for Hadoop on Amazon EMR that the JVM version should be consistent across all Hadoop components.

 The following example shows how to add required configuration parameters for EMR 7.0.0\$1 to ensure consistent Java version usage across all components.

```
[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.executorEnv.JAVA_HOME": "/usr/lib/jvm/java-1.8.0",
      "spark.yarn.appMasterEnv.JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
    }
  },
  {
    "Classification": "hadoop-env",
    "Configurations": [
      {
        "Classification": "export",
        "Configurations": [],
        "Properties": {
          "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Configurations": [],
        "Properties": {
          "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
        }
      }
    ],
    "Properties": {}
  }
]
```

### Override the JVM setting with Apache HBase
Apache HBase

To configure HBase to use Java 11, you can set the following configuration when you launch the cluster.

```
[
    {
        "Classification": "hbase-env",
        "Properties": {},
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/jre-11",
                    "HBASE_OPTS": "-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -Dsun.net.inetaddr.ttl=5"
                },
                "Configurations": []
            }
        ]
    }
]
```

### Override the JVM setting with Apache Hadoop and Hive
Apache Hadoop and Hive

The following example shows how to set the JVM to version 17 for Hadoop and Hive.

```
[
    {
        "Classification": "hadoop-env", 
            "Configurations": [
                {
                    "Classification": "export", 
                    "Configurations": [], 
                    "Properties": {
                        "JAVA_HOME": "/usr/lib/jvm/jre-17"
                    }
                }
        ], 
        "Properties": {}
    }
]
```

## Service ports


The following are YARN and HDFS service ports. These settings reflect Hadoop defaults. Other application services are hosted at default ports unless otherwise documented. For more information, see the application's project documentation.


**Port settings for YARN and HDFS**  

| Setting | Hostname/Port | 
| --- | --- | 
| `fs.default.name` | default (`hdfs://emrDeterminedIP:8020`) | 
| `dfs.datanode.address` | default (`0.0.0.0:50010`) | 
| `dfs.datanode.http.address` | default (`0.0.0.0:50075`) | 
| `dfs.datanode.https.address` | default (`0.0.0.0:50475`) | 
| `dfs.datanode.ipc.address` | default (`0.0.0.0:50020`) | 
| `dfs.http.address` | default (`0.0.0.0:50070`) | 
| `dfs.https.address` | default (`0.0.0.0:50470`) | 
| `dfs.secondary.http.address` | default (`0.0.0.0:50090`) | 
| `yarn.nodemanager.address` | default (`${yarn.nodemanager.hostname}:0`) | 
| `yarn.nodemanager.localizer.address` | default (`${yarn.nodemanager.hostname}:8040`) | 
| `yarn.nodemanager.webapp.address` | default (`${yarn.nodemanager.hostname}:8042`) | 
| `yarn.resourcemanager.address` | default (`${yarn.resourcemanager.hostname}:8032`) | 
| `yarn.resourcemanager.admin.address` | default (`${yarn.resourcemanager.hostname}:8033`) | 
| `yarn.resourcemanager.resource-tracker.address` | default (`${yarn.resourcemanager.hostname}:8031`) | 
| `yarn.resourcemanager.scheduler.address` | default (`${yarn.resourcemanager.hostname}:8030`) | 
| `yarn.resourcemanager.webapp.address` | default (`${yarn.resourcemanager.hostname}:8088`) | 
| `yarn.web-proxy.address` | default (`no-value`) | 
| `yarn.resourcemanager.hostname` | `emrDeterminedIP` | 

**Note**  
The term *emrDeterminedIP* is an IP address that is generated by the Amazon EMR control plane. In the newer version, this convention has been removed, except for the `yarn.resourcemanager.hostname` and `fs.default.name` settings.

## Application users


Applications run processes as their own user. For example, Hive JVMs run as user `hive`, MapReduce JVMs run as `mapred`, and so on. This is demonstrated in the following process status example.

```
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
hive      6452  0.2  0.7 853684 218520 ?       Sl   16:32   0:13 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop
hive      6557  0.2  0.6 849508 202396 ?       Sl   16:32   0:09 /usr/lib/jvm/java-openjdk/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=/usr/lib/hadoop/l
hbase     6716  0.1  1.0 1755516 336600 ?      Sl   Jun21   2:20 /usr/lib/jvm/java-openjdk/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hbase     6871  0.0  0.7 1672196 237648 ?      Sl   Jun21   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_thrift -XX:OnOutOfMemoryError=kill -9 %p -Xmx1024m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/
hdfs      7491  0.4  1.0 1719476 309820 ?      Sl   16:32   0:22 /usr/lib/jvm/java-openjdk/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-ip-10-71-203-213.log -Dhadoo
yarn      8524  0.1  0.6 1626164 211300 ?      Sl   16:33   0:05 /usr/lib/jvm/java-openjdk/bin/java -Dproc_proxyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-
yarn      8646  1.0  1.2 1876916 385308 ?      Sl   16:33   0:46 /usr/lib/jvm/java-openjdk/bin/java -Dproc_resourcemanager -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-y
mapred    9265  0.2  0.8 1666628 260484 ?      Sl   16:33   0:12 /usr/lib/jvm/java-openjdk/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop
```