

# Improve Hive performance
<a name="emr-hive-s3-performance"></a>

Amazon EMR offers features to help optimize performance when using Hive to query, read and write data saved in Amazon S3.

S3 Select can improve query performance for CSV and JSON files in some applications by “pushing down” processing to Amazon S3.

The EMRFS S3 optimized committer is an alternative to the [OutputCommitter](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/OutputCommitter.html) class, that eliminates list and rename operations to improve performance when writing files Amazon S3 using EMRFS.

**Topics**
+ [Enabling Hive EMRFS S3 optimized committer](hive-optimized-committer.md)
+ [Using S3 Select with Hive to improve performance](emr-hive-s3select.md)
+ [MSCK Optimization](emr-msck-optimization.md)

# Enabling Hive EMRFS S3 optimized committer
<a name="hive-optimized-committer"></a>

The Hive EMRFS S3 Optimized Committer is an alternative way using which EMR Hive writes files for insert queries when using EMRFS. The Committer eliminates list and rename operations done on Amazon S3 and improves application’s performance. The feature is available beginning with EMR 5.34 and EMR 6.5.

## Enabling the committer
<a name="enabling-hive-committer"></a>

If you want to enable EMR Hive to use `HiveEMRFSOptimizedCommitter` to commit data as the default for all Hive managed and external tables, use the following `hive-site` configuration in EMR 6.5.0 or EMR 5.34.0 clusters.

```
[
   {
      "classification": "hive-site",
      "properties": {
         "hive.blobstore.use.output-committer": "true"
      }
   }
]
```

**Note**  
Do not turn this feature on when `hive.exec.parallel` is set to `true`.

## Limitations
<a name="hive-committer-limitations"></a>

The following basic restrictions apply to tags:
+ Enabling Hive to merge small files automatically is not supported. The default Hive commit logic will be used even when the optimized committer is enabled.
+ Hive ACID tables are not supported. The default Hive commit logic will be used even when the optimized committer is enabled.
+ File naming nomenclature for files written is changed from Hive’s `<task_id>_<attempt_id>_<copy_n>` to `<task_id>_<attempt_id>_<copy_n>_<query_id>`. For example, a file named 

  `s3://warehouse/table/partition=1/000000_0` will be changed to `s3://warehouse/table/partition=1/000000_0-hadoop_20210714130459_ba7c23ec-5695-4947-9d98-8a40ef759222-1`. The `query_id` here is a combination of the username, time stamp, and UUID.
+ When custom partitions are on different file systems (HDFS, S3), this feature is automatically disabled. The default Hive commit logic will be used when enabled.

# Using S3 Select with Hive to improve performance
<a name="emr-hive-s3select"></a>

**Important**  
Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use the feature as usual. [Learn more](https://aws.amazon.com/blogs/storage/how-to-optimize-querying-your-data-in-amazon-s3/) 

With Amazon EMR release version 5.18.0 and later, you can use [S3 Select](https://aws.amazon.com/blogs/aws/s3-glacier-select/) with Hive on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object. For Amazon EMR, the computational work of filtering large datasets for processing is "pushed down" from the cluster to Amazon S3, which can improve performance in some applications and reduces the amount of data transferred between Amazon EMR and Amazon S3.

S3 Select is supported with Hive tables based on CSV and JSON files and by setting the `s3select.filter` configuration variable to `true` during your Hive session. For more information and examples, see [Specifying S3 Select in your code](#emr-hive-s3select-specify).

## Is S3 Select right for my application?
<a name="emr-hive-s3select-apps"></a>

We recommend that you benchmark your applications with and without S3 Select to see if using it may be suitable for your application.

Use the following guidelines to determine if your application is a candidate for using S3 Select:
+ Your query filters out more than half of the original dataset.
+ Your query filter predicates use columns that have a data type supported by Amazon S3 Select. For more information, see [Data types](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-data-types.html) in the *Amazon Simple Storage Service User Guide*.
+ Your network connection between Amazon S3 and the Amazon EMR cluster has good transfer speed and available bandwidth. Amazon S3 does not compress HTTP responses, so the response size is likely to increase for compressed input files.

## Considerations and limitations
<a name="emr-hive-s3select-considerations"></a>
+ Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption are not supported. 
+ The `AllowQuotedRecordDelimiters` property is not supported. If this property is specified, the query fails.
+ Only CSV and JSON files in UTF-8 format are supported. Multi-line CSVs and JSON are not supported.
+ Only uncompressed or gzip or bzip2 files are supported.
+ Comment characters in the last line are not supported.
+ Empty lines at the end of a file are not processed.
+ Hive on Amazon EMR supports the primitive data types that S3 Select supports. For more information, see [Data types](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-data-types.html) in the *Amazon Simple Storage Service User Guide*.

## Specifying S3 Select in your code
<a name="emr-hive-s3select-specify"></a>

To use S3 Select in your Hive table, create the table by specifying `com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat` as the `INPUTFORMAT` class name, and specify a value for the `s3select.format` property using the `TBLPROPERTIES` clause.

By default, S3 Select is disabled when you run queries. Enable S3 Select by setting `s3select.filter` to `true` in your Hive session as shown below. The examples below demonstrate how to specify S3 Select when creating a table from underlying CSV and JSON files and then querying the table using a simple select statement.

**Example CREATE TABLE statement for CSV-based table**  

```
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/mycsvfile/'
TBLPROPERTIES (
  "s3select.format" = "csv",
  "s3select.headerInfo" = "ignore"
);
```

**Example CREATE TABLE statement for JSON-based table**  

```
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/json/'
TBLPROPERTIES (
  "s3select.format" = "json"
);
```

**Example SELECT TABLE statement**  

```
SET s3select.filter=true;
SELECT * FROM mys3selecttable WHERE col2 > 10;
```

# MSCK Optimization
<a name="emr-msck-optimization"></a>

Hive stores a list of partitions for each table in its metastore. However, when partitions are directly added to or removed from the file system, the Hive metastore is unaware of these changes. The [ MSCK command](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)) updates the partition metadata in the Hive metastore for partitions that were directly added to or removed from the file system. The syntax for the command is:

```
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
```

Hive implements this command as follows:

1. Hive retrieves all the partitions for the table from the metastore. From the list of partition paths that do not exist in the file system then creates a list of partitions to drop from the metastore.

1. Hive gathers the partition paths present in the file system, compares them with the list of partitions from the metastore, and generates a list of partitions that need to be added to the metastore.

1. Hive updates the metastore using `ADD`, `DROP`, or `SYNC` mode.

**Note**  
When there are many partitions in the metastore, the step to check if a partition does not exist in the file system takes a long time to run because the file system's `exists` API call must be made for each partition.

In Amazon EMR 6.5.0, Hive introduced a flag called `hive.emr.optimize.msck.fs.check`. When enabled, this flag causes Hive to check for the presence of a partition from the list of partition paths from the file system that is generated in step 2 above instead of making file system API calls. In Amazon EMR 6.8.0, Hive enabled this optimization by default, eliminating the need to set the flag `hive.emr.optimize.msck.fs.check`.