

# Tutorial: Working with Amazon DynamoDB and Apache Hive
<a name="EMRforDynamoDB.Tutorial"></a>

In this tutorial, you will launch an Amazon EMR cluster, and then use Apache Hive to process data stored in a DynamoDB table.

*Hive* is a data warehouse application for Hadoop that allows you to process and analyze data from multiple sources. Hive provides a SQL-like language, *HiveQL*, that lets you work with data stored locally in the Amazon EMR cluster or in an external data source (such as Amazon DynamoDB).

For more information, see to the [Hive Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial).

**Topics**
+ [

## Before you begin
](#EMRforDynamoDB.Tutorial.BeforeYouBegin)
+ [

# Step 1: Create an Amazon EC2 key pair
](EMRforDynamoDB.Tutorial.EC2KeyPair.md)
+ [

# Step 2: Launch an Amazon EMR cluster
](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)
+ [

# Step 3: Connect to the Leader node
](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)
+ [

# Step 4: Load data into HDFS
](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)
+ [

# Step 5: Copy data to DynamoDB
](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)
+ [

# Step 6: Query the data in the DynamoDB table
](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)
+ [

# Step 7: (Optional) clean up
](EMRforDynamoDB.Tutorial.CleanUp.md)

## Before you begin
<a name="EMRforDynamoDB.Tutorial.BeforeYouBegin"></a>

For this tutorial, you will need the following:
+ An AWS account. If you do not have one, see [Signing up for AWS](SettingUp.DynamoWebService.md#SettingUp.DynamoWebService.SignUpForAWS).
+ An SSH client (Secure Shell). You use the SSH client to connect to the leader node of the Amazon EMR cluster and run interactive commands. SSH clients are available by default on most Linux, Unix, and Mac OS X installations. Windows users can download and install the [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/) client, which has SSH support.

**Next step**  
[Step 1: Create an Amazon EC2 key pair](EMRforDynamoDB.Tutorial.EC2KeyPair.md)

# Step 1: Create an Amazon EC2 key pair
<a name="EMRforDynamoDB.Tutorial.EC2KeyPair"></a>

In this step, you will create the Amazon EC2 key pair you need to connect to an Amazon EMR leader node and run Hive commands.

1. Sign in to the AWS Management Console and open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/).

1. Choose a region (for example, `US West (Oregon)`). This should be the same region in which your DynamoDB table is located.

1. In the navigation pane, choose **Key Pairs**.

1. Choose **Create Key Pair**. 

1. In **Key pair name**, type a name for your key pair (for example, `mykeypair`), and then choose **Create**. 

1. Download the private key file. The file name will end with `.pem` (such as `mykeypair.pem`). Keep this private key file in a safe place. You will need it to access any Amazon EMR cluster that you launch with this key pair. 
**Important**  
If you lose the key pair, you cannot connect to the leader node of your Amazon EMR cluster.

   For more information about key pairs, see [Amazon EC2 Key Pairs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) in the *Amazon EC2 User Guide*. 

**Next step**  
[Step 2: Launch an Amazon EMR cluster](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)

# Step 2: Launch an Amazon EMR cluster
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster"></a>

In this step, you will configure and launch an Amazon EMR cluster. Hive and a storage handler for DynamoDB will already be installed on the cluster.

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

1. Choose **Create Cluster**.

1. On the **Create Cluster - Quick Options** page, do the following:

   1. In **Cluster name**, type a name for your cluster (for example: `My EMR cluster`).

   1. In **EC2 key pair**, choose the key pair you created earlier.

   Leave the other settings at their defaults.

1. Choose **Create cluster**.

It will take several minutes to launch your cluster. You can use the **Cluster Details** page in the Amazon EMR console to monitor its progress.

When the status changes to `Waiting`, the cluster is ready for use.

## Cluster log files and Amazon S3
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster.LogFilesAndS3"></a>

An Amazon EMR cluster generates log files that contain information about the cluster status and debugging information. The default settings for **Create Cluster - Quick Options** include setting up Amazon EMR logging.

If one does not already exist, the AWS Management Console creates an Amazon S3 bucket. The bucket name is `aws-logs-account-id-region`, where ` account-id` is your AWS account number and `region` is the region in which you launched the cluster (for example, `aws-logs-123456789012-us-west-2`).

**Note**  
You can use the Amazon S3 console to view the log files. For more information, see [View Log Files](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.html) in the *Amazon EMR Management Guide*.

You can use this bucket for purposes in addition to logging. For example, you can use the bucket as a location for storing a Hive script or as a destination when exporting data from Amazon DynamoDB to Amazon S3.

**Next step**  
[Step 3: Connect to the Leader node](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)

# Step 3: Connect to the Leader node
<a name="EMRforDynamoDB.Tutorial.ConnectToLeaderNode"></a>

When the status of your Amazon EMR cluster changes to `Waiting`, you will be able to connect to the leader node using SSH and perform command line operations.

1. In the Amazon EMR console, choose your cluster's name to view its status.

1. On the **Cluster Details** page, find the **Leader public DNS** field. This is the public DNS name for the leader node of your Amazon EMR cluster.

1. To the right of the DNS name, choose the **SSH** link.

1. Follow the instructions in **Connect to the Leader Node Using SSH **.

   Depending on your operating system, choose the **Windows** tab or the **Mac/Linux** tab, and follow the instructions for connecting to the leader node.

After you connect to the leader node using either SSH or PuTTY, you should see a command prompt similar to the following:

```
[hadoop@ip-192-0-2-0 ~]$ 
```

**Next step**  
[Step 4: Load data into HDFS](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)

# Step 4: Load data into HDFS
<a name="EMRforDynamoDB.Tutorial.LoadDataIntoHDFS"></a>

In this step, you will copy a data file into Hadoop Distributed File System (HDFS), and then create an external Hive table that maps to the data file.

**Download the sample data**

1. Download the sample data archive (`features.zip`):

   ```
   wget https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/samples/features.zip
   ```

1. Extract the `features.txt` file from the archive:

   ```
   unzip features.zip
   ```

1. View the first few lines of the `features.txt` file:

   ```
   head features.txt
   ```

   The result should look similar to this:

   ```
   1535908|Big Run|Stream|WV|38.6370428|-80.8595469|794
   875609|Constable Hook|Cape|NJ|40.657881|-74.0990309|7
   1217998|Gooseberry Island|Island|RI|41.4534361|-71.3253284|10
   26603|Boone Moore Spring|Spring|AZ|34.0895692|-111.410065|3681
   1506738|Missouri Flat|Flat|WA|46.7634987|-117.0346113|2605
   1181348|Minnow Run|Stream|PA|40.0820178|-79.3800349|1558
   1288759|Hunting Creek|Stream|TN|36.343969|-83.8029682|1024
   533060|Big Charles Bayou|Bay|LA|29.6046517|-91.9828654|0
   829689|Greenwood Creek|Stream|NE|41.596086|-103.0499296|3671
   541692|Button Willow Island|Island|LA|31.9579389|-93.0648847|98
   ```

   The `features.txt` file contains a subset of data from the United States Board on Geographic Names ([http://geonames.usgs.gov/domestic/download\$1data.htm](http://geonames.usgs.gov/domestic/download_data.htm)). The fields in each line represent the following:
   + Feature ID (unique identifier)
   + Name
   + Class (lake; forest; stream; and so on)
   + State
   + Latitude (degrees)
   + Longitude (degrees)
   + Height (in feet)

1. At the command prompt, enter the following command:

   ```
   hive
   ```

   The command prompt changes to this: `hive>` 

1. Enter the following HiveQL statement to create a native Hive table:

   ```
   CREATE TABLE hive_features
       (feature_id             BIGINT,
       feature_name            STRING ,
       feature_class           STRING ,
       state_alpha             STRING,
       prim_lat_dec            DOUBLE ,
       prim_long_dec           DOUBLE ,
       elev_in_ft              BIGINT)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '\n';
   ```

1. Enter the following HiveQL statement to load the table with data:

   ```
   LOAD DATA
   LOCAL
   INPATH './features.txt'
   OVERWRITE
   INTO TABLE hive_features;
   ```

1. You now have a native Hive table populated with data from the `features.txt` file. To verify, enter the following HiveQL statement:

   ```
   SELECT state_alpha, COUNT(*)
   FROM hive_features
   GROUP BY state_alpha;
   ```

   The output should show a list of states and the number of geographic features in each.

**Next step**  
[Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)

# Step 5: Copy data to DynamoDB
<a name="EMRforDynamoDB.Tutorial.CopyDataToDDB"></a>

In this step, you will copy data from the Hive table (`hive_features`) to a new table in DynamoDB.

1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. Choose **Create Table**.

1. On the **Create DynamoDB table** page, do the following:

   1. In **Table**, type **Features**.

   1. For **Primary key**, in the **Partition key** field, type **Id**. Set the data type to **Number**.

      Clear **Use Default Settings**. For **Provisioned Capacity**, type the following:
      + **Read Capacity Units**—`10`
      + **Write Capacity Units**—`10`

   Choose **Create**.

1. At the Hive prompt, enter the following HiveQL statement: 

   ```
   CREATE EXTERNAL TABLE ddb_features
       (feature_id   BIGINT,
       feature_name  STRING,
       feature_class STRING,
       state_alpha   STRING,
       prim_lat_dec  DOUBLE,
       prim_long_dec DOUBLE,
       elev_in_ft    BIGINT)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES(
       "dynamodb.table.name" = "Features",
       "dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
   );
   ```

   You have now established a mapping between Hive and the Features table in DynamoDB.

1. Enter the following HiveQL statement to import data to DynamoDB:

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT
       feature_id,
       feature_name,
       feature_class,
       state_alpha,
       prim_lat_dec,
       prim_long_dec,
       elev_in_ft
   FROM hive_features;
   ```

   Hive will submit a MapReduce job, which will be processed by your Amazon EMR cluster. It will take several minutes to complete the job.

1. Verify that the data has been loaded into DynamoDB:

   1. In the DynamoDB console navigation pane, choose **Tables**.

   1. Choose the Features table, and then choose the **Items** tab to view the data.

**Next step**  
[Step 6: Query the data in the DynamoDB table](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)

# Step 6: Query the data in the DynamoDB table
<a name="EMRforDynamoDB.Tutorial.QueryDataInDynamoDB"></a>

In this step, you will use HiveQL to query the Features table in DynamoDB. Try the following Hive queries:

1. All of the feature types (`feature_class`) in alphabetical order:

   ```
   SELECT DISTINCT feature_class
   FROM ddb_features
   ORDER BY feature_class;
   ```

1. All of the lakes that begin with the letter "M":

   ```
   SELECT feature_name, state_alpha
   FROM ddb_features
   WHERE feature_class = 'Lake'
   AND feature_name LIKE 'M%'
   ORDER BY feature_name;
   ```

1. States with at least three features higher than a mile (5,280 feet):

   ```
   SELECT state_alpha, feature_class, COUNT(*)
   FROM ddb_features
   WHERE elev_in_ft > 5280
   GROUP by state_alpha, feature_class
   HAVING COUNT(*) >= 3
   ORDER BY state_alpha, feature_class;
   ```

**Next step**  
[Step 7: (Optional) clean up](EMRforDynamoDB.Tutorial.CleanUp.md)

# Step 7: (Optional) clean up
<a name="EMRforDynamoDB.Tutorial.CleanUp"></a>

Now that you have completed the tutorial, you can continue reading this section to learn more about working with DynamoDB data in Amazon EMR. You might decide to keep your Amazon EMR cluster up and running while you do this.

If you don't need the cluster anymore, you should terminate it and remove any associated resources. This will help you avoid being charged for resources you don't need.

1. Terminate the Amazon EMR cluster:

   1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

   1. Choose the Amazon EMR cluster, choose **Terminate**, and then confirm.

1. Delete the Features table in DynamoDB:

   1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

   1. In the navigation pane, choose **Tables**.

   1. Choose the Features table. From the **Actions** menu, choose **Delete Table**.

1. Delete the Amazon S3 bucket containing the Amazon EMR log files:

   1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

   1. From the list of buckets, choose `aws-logs- accountID-region`, where *accountID* is your AWS account number and *region* is the region in which you launched the cluster.

   1. From the **Action** menu, choose **Delete**.