

# Configure a location for Amazon EMR cluster output


 The most common output format of an Amazon EMR cluster is as text files, either compressed or uncompressed. Typically, these are written to an Amazon S3 bucket. This bucket must be created before you launch the cluster. You specify the S3 bucket as the output location when you launch the cluster. 

For more information, see the following topics:

**Topics**
+ [

## Create and configure an Amazon S3 bucket
](#create-s3-bucket-output)
+ [

# What formats can Amazon EMR return?
](emr-plan-output-formats.md)
+ [

# How to write data to an Amazon S3 bucket you don't own with Amazon EMR
](emr-s3-acls.md)
+ [

# Ways to compress the output of your Amazon EMR cluster
](emr-plan-output-compression.md)

## Create and configure an Amazon S3 bucket


Amazon EMR (Amazon EMR) uses Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as *buckets*. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements. For more information, go to [Bucket Restrictions and Limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service Developers Guide*.

To create a an Amazon S3 bucket, follow the instructions on the [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) page in the *Amazon Simple Storage Service Developers Guide*.

**Note**  
 If you enable logging in the **Create a Bucket** wizard, it enables only bucket access logs, not cluster logs. 

**Note**  
For more information on specifying Region-specific buckets, refer to [Buckets and Regions](https://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html) in the *Amazon Simple Storage Service Developer Guide* and [ Available Region Endpoints for the AWS SDKs ](https://aws.amazon.com/articles/available-region-endpoints-for-the-aws-sdks/).

 After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access. We strongly recommend that you follow [Security Best Practices for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html) when configuring your bucket. 

 Required Amazon S3 buckets must exist before you can create a cluster. You must upload any required scripts or data referenced in the cluster to Amazon S3. The following table describes example data, scripts, and log file locations. 


| Information | Example Location on Amazon S3 | 
| --- | --- | 
| script or program |  s3://amzn-s3-demo-bucket1/script/MapperScript.py  | 
| log files |  s3://amzn-s3-demo-bucket1/logs  | 
| input data |  s3://amzn-s3-demo-bucket1/input  | 
| output data |  s3://amzn-s3-demo-bucket1/output  | 

# What formats can Amazon EMR return?


 The default output format for a cluster is text with key, value pairs written to individual lines of the text files. This is the output format most commonly used. 

 If your output data needs to be written in a format other than the default text files, you can use the Hadoop interface `OutputFormat` to specify other output types. You can even create a subclass of the `FileOutputFormat` class to handle custom data types. For more information, see [http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html). 

 If you are launching a Hive cluster, you can use a serializer/deserializer (SerDe) to output data from HDFS to a given format. For more information, see [https://cwiki.apache.org/confluence/display/Hive/SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe). 

# How to write data to an Amazon S3 bucket you don't own with Amazon EMR


 When you write a file to an Amazon Simple Storage Service (Amazon S3) bucket, by default, you are the only one able to read that file. The assumption is that you will write files to your own buckets, and this default setting protects the privacy of your files. 

 However, if you are running a cluster, and you want the output to write to the Amazon S3 bucket of another AWS user, and you want that other AWS user to be able to read that output, you must do two things: 
+  Have the other AWS user grant you write permissions for their Amazon S3 bucket. The cluster you launch runs under your AWS credentials, so any clusters you launch will also be able to write to that other AWS user's bucket. 
+  Set read permissions for the other AWS user on the files that you or the cluster write to the Amazon S3 bucket. The easiest way to set these read permissions is to use canned access control lists (ACLs), a set of pre-defined access policies defined by Amazon S3. 

 For information about how the other AWS user can grant you permissions to write files to the other user's Amazon S3 bucket, see [Editing bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EditingBucketPermissions.html) in the *Amazon Simple Storage Service User Guide*. 

 For your cluster to use canned ACLs when it writes files to Amazon S3, set the `fs.s3.canned.acl` cluster configuration option to the canned ACL to use. The following table lists the currently defined canned ACLs. 


| Canned ACL | Description | 
| --- | --- | 
| AuthenticatedRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AuthenticatedUsers group grantee is granted Permission.Read access. | 
| BucketOwnerFullControl | Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object. | 
| BucketOwnerRead | Specifies that the owner of the bucket is granted Permission.Read. The owner of the bucket is not necessarily the same as the owner of the object. | 
| LogDeliveryWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.LogDelivery group grantee is granted Permission.Write access, so that access logs can be delivered. | 
| Private | Specifies that the owner is granted Permission.FullControl. | 
| PublicRead | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read access. | 
| PublicReadWrite | Specifies that the owner is granted Permission.FullControl and the GroupGrantee.AllUsers group grantee is granted Permission.Read and Permission.Write access. | 

 There are many ways to set the cluster configuration options, depending on the type of cluster you are running. The following procedures show how to set the option for common cases. 

**To write files using canned ACLs in Hive**
+  From the Hive command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Hive command prompt connect to the master node using SSH, and type Hive at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to `BucketOwnerFullControl`, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command is case sensitive and contains no quotation marks or spaces. 

  ```
  hive> set fs.s3.canned.acl=BucketOwnerFullControl;   
  create table acl (n int) location 's3://amzn-s3-demo-bucket/acl/'; 
  insert overwrite table acl select count(*) from acl;
  ```

   The last two lines of the example create a table that is stored in Amazon S3 and write data to the table. 

**To write files using canned ACLs in Pig**
+  From the Pig command prompt, set the `fs.s3.canned.acl` configuration option to the canned ACL you want to have the cluster set on files it writes to Amazon S3. To access the Pig command prompt connect to the master node using SSH, and type Pig at the Hadoop command prompt. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](emr-connect-master-node-ssh.md). 

   The following example sets the `fs.s3.canned.acl` configuration option to BucketOwnerFullControl, which gives the owner of the Amazon S3 bucket complete control over the file. Note that the set command includes one space before the canned ACL name and contains no quotation marks. 

  ```
  pig> set fs.s3.canned.acl BucketOwnerFullControl; 
  store some data into 's3://amzn-s3-demo-bucket/pig/acl';
  ```

**To write files using canned ACLs in a custom JAR**
+  Set the `fs.s3.canned.acl` configuration option using Hadoop with the -D flag. This is shown in the example below. 

  ```
  hadoop jar hadoop-examples.jar wordcount 
  -Dfs.s3.canned.acl=BucketOwnerFullControl s3://amzn-s3-demo-bucket/input s3://amzn-s3-demo-bucket/output
  ```

# Ways to compress the output of your Amazon EMR cluster


There are different ways to compress output that results from data processing. The compression tools you use depend on properties of your data. Compression can improve performance when you transfer large amounts of data.

## Output data compression


 This compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'ed text file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressed internally. This can be enabled by setting the configuration setting mapred.output.compress to true. 

 If you are running a streaming job you can enable this by passing the streaming job these arguments. 

```
1. -jobconf mapred.output.compress=true
```

 You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that with the Ruby client. 

```
1.    
2. --bootstrap-actions s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
3. --args "-s,mapred.output.compress=true"
```

 Finally, if are writing a Custom Jar you can enable output compression with the following line when creating your job. 

```
1. FileOutputFormat.setCompressOutput(conf, true);
```

## Intermediate data compression


 If your job shuffles a significant amount data from the mappers to the reducers, you can see a performance improvement by enabling intermediate compression. Compress the map output and decompress it when it arrives on the core node. The configuration setting is mapred.compress.map.output. You can enable this similarly to output compression. 

 When writing a Custom Jar, use the following command: 

```
1. conf.setCompressMapOutput(true);
```

## Using the Snappy library with Amazon EMR


Snappy is a compression and decompression library that is optimized for speed. It is available on Amazon EMR AMIs version 2.0 and later and is used as the default for intermediate compression. For more information about Snappy, go to [http://code.google.com/p/snappy/](http://code.google.com/p/snappy/). 